Thank you, that works as promised and removes the warning. I'm still hoping
to find a resource that better explains the pieces of PDFBox and how they
work together. Unfortunately most posts on the internet are solely how and
not why.

Appreciate it!

-Aaron


On Sun, Jul 27, 2014 at 8:00 AM, Tilman Hausherr <thaush...@t-online.de>
wrote:

> Hi,
>
> That didn't happen to me, but maybe it did happen to you with another file.
>
> Another solution would be to pass your own properties file, and it should
> have this content:
>
> =======================
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #      http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
>
> # This table is maps PDF stream operators to concrete OperatorProcessor
> # subclasses that are used by the PDFStreamEngine class to interpret the
> # PDF document. The classes configured here allow the PDFTextStripper
> # subclass of PDFStreamEngine to extract text content of the document.
>
> BT = org.apache.pdfbox.util.operator.BeginText
> cm = org.apache.pdfbox.util.operator.Concatenate
> Do = org.apache.pdfbox.util.operator.Invoke
> ET = org.apache.pdfbox.util.operator.EndText
> gs = org.apache.pdfbox.util.operator.SetGraphicsStateParameters
> q  = org.apache.pdfbox.util.operator.GSave
> Q  = org.apache.pdfbox.util.operator.GRestore
> T* = org.apache.pdfbox.util.operator.NextLine
> Tc = org.apache.pdfbox.util.operator.SetCharSpacing
> Td = org.apache.pdfbox.util.operator.MoveText
> TD = org.apache.pdfbox.util.operator.MoveTextSetLeading
> Tf = org.apache.pdfbox.util.operator.SetTextFont
> Tj = org.apache.pdfbox.util.operator.ShowText
> TJ = org.apache.pdfbox.util.operator.ShowTextGlyph
> TL = org.apache.pdfbox.util.operator.SetTextLeading
> Tm = org.apache.pdfbox.util.operator.SetMatrix
> Tr = org.apache.pdfbox.util.operator.SetTextRenderingMode
> Ts = org.apache.pdfbox.util.operator.SetTextRise
> Tw = org.apache.pdfbox.util.operator.SetWordSpacing
> Tz = org.apache.pdfbox.util.operator.SetHorizontalTextScaling
> w  = org.apache.pdfbox.util.operator.SetLineWidth
> \' = org.apache.pdfbox.util.operator.MoveAndShow
> \" = org.apache.pdfbox.util.operator.SetMoveAndShow
>
> CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace
> cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace
> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
> G=org.apache.pdfbox.util.operator.SetStrokingGrayColor
> g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor
> K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor
> k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor
> RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor
> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
> SC=org.apache.pdfbox.util.operator.SetStrokingColor
> sc=org.apache.pdfbox.util.operator.SetNonStrokingColor
> SCN=org.apache.pdfbox.util.operator.SetStrokingColor
> scn=org.apache.pdfbox.util.operator.SetNonStrokingColor
>
> # The following operators are not relevant to text extraction,
> # so we can silently ignore them.
>
> b
> B
> b*
> B*
> BDC
> BI
> BMC
> BX
> c
> d
> d0
> d1
> DP
> El
> EMC
> EX
> f
> F
> f*
> h
> i
> ID
> j
> J
> l
> m
> M
> MP
> n
> re
> ri
> s
> S
> sh
> v
> W
> W*
> y
>
> =======================
>
> Tilman
>
> Am 27.07.2014 15:54, schrieb -A:
>
>  Tilman;
>>
>> That is somewhat embarrassing. At one point I brought this to the mailing
>> list (because of the following warning) and was told to remove that line
>> because the TextStripper wasn't actually a PageDrawer. The functionality
>> still worked after that, however.
>>
>> Is there a way to do this without the warning, perhaps something within
>> PageDrawer?
>>
>>
>> Thank you,
>> -Aaron
>>
>>
>> WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot be
>> cast to org.apache.pdfbox.pdfviewer.PageDrawer
>> java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to
>> org.apache.pdfbox.pdfviewer.PageDrawer
>>   at
>> org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process(
>> AppendRectangleToPath.java:46)
>>   at
>> org.apache.pdfbox.util.PDFStreamEngine.processOperator(
>> PDFStreamEngine.java:557)
>> at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(
>> PDFStreamEngine.java:268)
>>   at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(
>> PDFStreamEngine.java:235)
>>   at
>> org.apache.pdfbox.util.PDFStreamEngine.processStream(
>> PDFStreamEngine.java:215)
>> at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90)
>>   at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56)
>>
>>
>>
>>
>> On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr <thaush...@t-online.de>
>> wrote:
>>
>>  It is even easier than I thought - replace super() with this:
>>>
>>> super(ResourceLoader.loadProperties("org/apache/
>>> pdfbox/resources/PageDrawer.properties", true));
>>>
>>> Tilman
>>>
>>> Am 27.07.2014 13:03, schrieb Tilman Hausherr:
>>>
>>>   After having written the text below, I tested by including the "rg"
>>>
>>>> operator in the properties list and now it worked. I also tested
>>>> deleting
>>>> your println and instead adding this if the text is red:
>>>>
>>>>      System.out.print (textPos.getCharacter());
>>>>
>>>> and so I got this output:
>>>>
>>>> 21_Key .1295 R~Wall Prof LinP 0.003             0.004     0.000 true
>>>>
>>>> which is exactly what is red in the PDF.
>>>>
>>>> Another way (probably better) to do it would probably be to not derive
>>>> PDFTextStripper but |PDFStreamEngine and construct it with||
>>>>
>>>> ResourceLoader.loadProperties("org/apache/pdfbox/resources/
>>>> PageDrawer.properties")|
>>>>
>>>>
>>>> see also http://stackoverflow.com/a/9157714/535646
>>>>
>>>> Tilman
>>>>
>>>>
>>>> Am 27.07.2014 12:14, schrieb Tilman Hausherr:
>>>>
>>>>  Hi,
>>>>>
>>>>> Do you still have the code that worked?
>>>>>
>>>>> I'm not the text extraction specialist here, but what I did was to look
>>>>> in the uncompressed source of the PDF. The stream has code like this:
>>>>>
>>>>> 0 0 0 rg
>>>>> 0 0.5019 0 rg
>>>>> 1 0 0 rg
>>>>>
>>>>> The first line sets to black, the second to green, the third to red.
>>>>> And
>>>>> from what I saw, it can't work at all, because the "rg" operator isn't
>>>>> processed when extracting text, because PDFTextStripper.properties
>>>>> doesn't
>>>>> contain the "rg" operator. (The operator is in another list, which is
>>>>> used
>>>>> when rendering)
>>>>>
>>>>> So that is what puzzles me. I think it can't work at all. But you said
>>>>> it did work at a time.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> Am 27.07.2014 07:43, schrieb Tilman Hausherr:
>>>>>
>>>>>  Hi,
>>>>>>
>>>>>> Please upload the PDF somewhere and post the URL, PDF files are
>>>>>> removed
>>>>>> from the mailing list.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 27.07.2014 02:35, schrieb -A:
>>>>>>
>>>>>>  Hello again. I've been trying to figure out this issue that has come
>>>>>>> up for me and in my research I found someone posting on
>>>>>>> StackOverflow (
>>>>>>> http://stackoverflow.com/questions/10844271/how-to-get-
>>>>>>> font-color-using-pdfbox) a similar issue where they could not read
>>>>>>> any colors from a PDF. The user posted the code and someone else
>>>>>>> took it,
>>>>>>> ran it, and reported that it worked. The users approach was
>>>>>>> different than
>>>>>>> mine, but alas.
>>>>>>>
>>>>>>> I'm not sure at this point what is going on. I have stepped through
>>>>>>> each individual character and checked the PDGraphicsState object,
>>>>>>> and even
>>>>>>> when I am looking at an open file with visibly red text (attached)
>>>>>>> the
>>>>>>> debugger only reports DeviceGray. If I print out the ColorSpace name
>>>>>>> from
>>>>>>> the PDGraphicsState this is what is printed - for every character.
>>>>>>>
>>>>>>> I would appreciate if someone could perhaps run the attached text
>>>>>>> stripper with the attached PDF file and report back if it actually
>>>>>>> prints
>>>>>>> trueinstead of false, as it does for me. Since I saw this occurrence
>>>>>>> elsewhere I'd like to rule that out - in case an IDE setting of some
>>>>>>> sort
>>>>>>> may be causing this?
>>>>>>>
>>>>>>> It should be noted that I began using PDFBox with 1.8.5 and had this
>>>>>>> code working fine. Still with 1.8.5 yesterday it was failing.
>>>>>>> Upgrading to
>>>>>>> 1.8.6 yielded the same results.
>>>>>>>
>>>>>>> If this is an actual issue I do not mind attempting to solve it if
>>>>>>> someone may have a general idea where to point me as to prevent
>>>>>>> needless
>>>>>>> meddling with graphics state objects. Or, if this should be reported
>>>>>>> I can
>>>>>>> do that as well.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> -Aaron
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Previous Message:*
>>>>>>> *
>>>>>>> *
>>>>>>> *
>>>>>>> *
>>>>>>> I've attached an updated stripper file with the only addition being a
>>>>>>> main function to test the class specifically.
>>>>>>>
>>>>>>> When ran with the PDF I have also attached it indeed does not
>>>>>>> recognize the red text.
>>>>>>>
>>>>>>> At this point it seems that this issue is solely dependent on PDFBox.
>>>>>>> I'll stay tuned for some insight hopefully. If any other information
>>>>>>> is
>>>>>>> needed, let me know!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>

Reply via email to