Thank you, that works as promised and removes the warning. I'm still hoping to find a resource that better explains the pieces of PDFBox and how they work together. Unfortunately most posts on the internet are solely how and not why.
Appreciate it! -Aaron On Sun, Jul 27, 2014 at 8:00 AM, Tilman Hausherr <thaush...@t-online.de> wrote: > Hi, > > That didn't happen to me, but maybe it did happen to you with another file. > > Another solution would be to pass your own properties file, and it should > have this content: > > ======================= > # Licensed to the Apache Software Foundation (ASF) under one or more > # contributor license agreements. See the NOTICE file distributed with > # this work for additional information regarding copyright ownership. > # The ASF licenses this file to You under the Apache License, Version 2.0 > # (the "License"); you may not use this file except in compliance with > # the License. You may obtain a copy of the License at > # > # http://www.apache.org/licenses/LICENSE-2.0 > # > # Unless required by applicable law or agreed to in writing, software > # distributed under the License is distributed on an "AS IS" BASIS, > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > # See the License for the specific language governing permissions and > # limitations under the License. > > # This table is maps PDF stream operators to concrete OperatorProcessor > # subclasses that are used by the PDFStreamEngine class to interpret the > # PDF document. The classes configured here allow the PDFTextStripper > # subclass of PDFStreamEngine to extract text content of the document. > > BT = org.apache.pdfbox.util.operator.BeginText > cm = org.apache.pdfbox.util.operator.Concatenate > Do = org.apache.pdfbox.util.operator.Invoke > ET = org.apache.pdfbox.util.operator.EndText > gs = org.apache.pdfbox.util.operator.SetGraphicsStateParameters > q = org.apache.pdfbox.util.operator.GSave > Q = org.apache.pdfbox.util.operator.GRestore > T* = org.apache.pdfbox.util.operator.NextLine > Tc = org.apache.pdfbox.util.operator.SetCharSpacing > Td = org.apache.pdfbox.util.operator.MoveText > TD = org.apache.pdfbox.util.operator.MoveTextSetLeading > Tf = org.apache.pdfbox.util.operator.SetTextFont > Tj = org.apache.pdfbox.util.operator.ShowText > TJ = org.apache.pdfbox.util.operator.ShowTextGlyph > TL = org.apache.pdfbox.util.operator.SetTextLeading > Tm = org.apache.pdfbox.util.operator.SetMatrix > Tr = org.apache.pdfbox.util.operator.SetTextRenderingMode > Ts = org.apache.pdfbox.util.operator.SetTextRise > Tw = org.apache.pdfbox.util.operator.SetWordSpacing > Tz = org.apache.pdfbox.util.operator.SetHorizontalTextScaling > w = org.apache.pdfbox.util.operator.SetLineWidth > \' = org.apache.pdfbox.util.operator.MoveAndShow > \" = org.apache.pdfbox.util.operator.SetMoveAndShow > > CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace > cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace > rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor > G=org.apache.pdfbox.util.operator.SetStrokingGrayColor > g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor > K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor > k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor > RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor > rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor > SC=org.apache.pdfbox.util.operator.SetStrokingColor > sc=org.apache.pdfbox.util.operator.SetNonStrokingColor > SCN=org.apache.pdfbox.util.operator.SetStrokingColor > scn=org.apache.pdfbox.util.operator.SetNonStrokingColor > > # The following operators are not relevant to text extraction, > # so we can silently ignore them. > > b > B > b* > B* > BDC > BI > BMC > BX > c > d > d0 > d1 > DP > El > EMC > EX > f > F > f* > h > i > ID > j > J > l > m > M > MP > n > re > ri > s > S > sh > v > W > W* > y > > ======================= > > Tilman > > Am 27.07.2014 15:54, schrieb -A: > > Tilman; >> >> That is somewhat embarrassing. At one point I brought this to the mailing >> list (because of the following warning) and was told to remove that line >> because the TextStripper wasn't actually a PageDrawer. The functionality >> still worked after that, however. >> >> Is there a way to do this without the warning, perhaps something within >> PageDrawer? >> >> >> Thank you, >> -Aaron >> >> >> WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot be >> cast to org.apache.pdfbox.pdfviewer.PageDrawer >> java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to >> org.apache.pdfbox.pdfviewer.PageDrawer >> at >> org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process( >> AppendRectangleToPath.java:46) >> at >> org.apache.pdfbox.util.PDFStreamEngine.processOperator( >> PDFStreamEngine.java:557) >> at >> org.apache.pdfbox.util.PDFStreamEngine.processSubStream( >> PDFStreamEngine.java:268) >> at >> org.apache.pdfbox.util.PDFStreamEngine.processSubStream( >> PDFStreamEngine.java:235) >> at >> org.apache.pdfbox.util.PDFStreamEngine.processStream( >> PDFStreamEngine.java:215) >> at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90) >> at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56) >> >> >> >> >> On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr <thaush...@t-online.de> >> wrote: >> >> It is even easier than I thought - replace super() with this: >>> >>> super(ResourceLoader.loadProperties("org/apache/ >>> pdfbox/resources/PageDrawer.properties", true)); >>> >>> Tilman >>> >>> Am 27.07.2014 13:03, schrieb Tilman Hausherr: >>> >>> After having written the text below, I tested by including the "rg" >>> >>>> operator in the properties list and now it worked. I also tested >>>> deleting >>>> your println and instead adding this if the text is red: >>>> >>>> System.out.print (textPos.getCharacter()); >>>> >>>> and so I got this output: >>>> >>>> 21_Key .1295 R~Wall Prof LinP 0.003 0.004 0.000 true >>>> >>>> which is exactly what is red in the PDF. >>>> >>>> Another way (probably better) to do it would probably be to not derive >>>> PDFTextStripper but |PDFStreamEngine and construct it with|| >>>> >>>> ResourceLoader.loadProperties("org/apache/pdfbox/resources/ >>>> PageDrawer.properties")| >>>> >>>> >>>> see also http://stackoverflow.com/a/9157714/535646 >>>> >>>> Tilman >>>> >>>> >>>> Am 27.07.2014 12:14, schrieb Tilman Hausherr: >>>> >>>> Hi, >>>>> >>>>> Do you still have the code that worked? >>>>> >>>>> I'm not the text extraction specialist here, but what I did was to look >>>>> in the uncompressed source of the PDF. The stream has code like this: >>>>> >>>>> 0 0 0 rg >>>>> 0 0.5019 0 rg >>>>> 1 0 0 rg >>>>> >>>>> The first line sets to black, the second to green, the third to red. >>>>> And >>>>> from what I saw, it can't work at all, because the "rg" operator isn't >>>>> processed when extracting text, because PDFTextStripper.properties >>>>> doesn't >>>>> contain the "rg" operator. (The operator is in another list, which is >>>>> used >>>>> when rendering) >>>>> >>>>> So that is what puzzles me. I think it can't work at all. But you said >>>>> it did work at a time. >>>>> >>>>> Tilman >>>>> >>>>> >>>>> Am 27.07.2014 07:43, schrieb Tilman Hausherr: >>>>> >>>>> Hi, >>>>>> >>>>>> Please upload the PDF somewhere and post the URL, PDF files are >>>>>> removed >>>>>> from the mailing list. >>>>>> >>>>>> Tilman >>>>>> >>>>>> Am 27.07.2014 02:35, schrieb -A: >>>>>> >>>>>> Hello again. I've been trying to figure out this issue that has come >>>>>>> up for me and in my research I found someone posting on >>>>>>> StackOverflow ( >>>>>>> http://stackoverflow.com/questions/10844271/how-to-get- >>>>>>> font-color-using-pdfbox) a similar issue where they could not read >>>>>>> any colors from a PDF. The user posted the code and someone else >>>>>>> took it, >>>>>>> ran it, and reported that it worked. The users approach was >>>>>>> different than >>>>>>> mine, but alas. >>>>>>> >>>>>>> I'm not sure at this point what is going on. I have stepped through >>>>>>> each individual character and checked the PDGraphicsState object, >>>>>>> and even >>>>>>> when I am looking at an open file with visibly red text (attached) >>>>>>> the >>>>>>> debugger only reports DeviceGray. If I print out the ColorSpace name >>>>>>> from >>>>>>> the PDGraphicsState this is what is printed - for every character. >>>>>>> >>>>>>> I would appreciate if someone could perhaps run the attached text >>>>>>> stripper with the attached PDF file and report back if it actually >>>>>>> prints >>>>>>> trueinstead of false, as it does for me. Since I saw this occurrence >>>>>>> elsewhere I'd like to rule that out - in case an IDE setting of some >>>>>>> sort >>>>>>> may be causing this? >>>>>>> >>>>>>> It should be noted that I began using PDFBox with 1.8.5 and had this >>>>>>> code working fine. Still with 1.8.5 yesterday it was failing. >>>>>>> Upgrading to >>>>>>> 1.8.6 yielded the same results. >>>>>>> >>>>>>> If this is an actual issue I do not mind attempting to solve it if >>>>>>> someone may have a general idea where to point me as to prevent >>>>>>> needless >>>>>>> meddling with graphics state objects. Or, if this should be reported >>>>>>> I can >>>>>>> do that as well. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> -Aaron >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Previous Message:* >>>>>>> * >>>>>>> * >>>>>>> * >>>>>>> * >>>>>>> I've attached an updated stripper file with the only addition being a >>>>>>> main function to test the class specifically. >>>>>>> >>>>>>> When ran with the PDF I have also attached it indeed does not >>>>>>> recognize the red text. >>>>>>> >>>>>>> At this point it seems that this issue is solely dependent on PDFBox. >>>>>>> I'll stay tuned for some insight hopefully. If any other information >>>>>>> is >>>>>>> needed, let me know! >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >