Tilman; That is somewhat embarrassing. At one point I brought this to the mailing list (because of the following warning) and was told to remove that line because the TextStripper wasn't actually a PageDrawer. The functionality still worked after that, however.
Is there a way to do this without the warning, perhaps something within PageDrawer? Thank you, -Aaron WARNING: java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to org.apache.pdfbox.pdfviewer.PageDrawer java.lang.ClassCastException: IncrementalPDFStripper cannot be cast to org.apache.pdfbox.pdfviewer.PageDrawer at org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.process(AppendRectangleToPath.java:46) at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268) at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235) at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215) at IncrementalPDFStripper.containsRed(IncrementalPDFStripper.java:90) at IncrementalPDFStripper.main(IncrementalPDFStripper.java:56) On Sun, Jul 27, 2014 at 5:47 AM, Tilman Hausherr <thaush...@t-online.de> wrote: > It is even easier than I thought - replace super() with this: > > super(ResourceLoader.loadProperties("org/apache/ > pdfbox/resources/PageDrawer.properties", true)); > > Tilman > > Am 27.07.2014 13:03, schrieb Tilman Hausherr: > > After having written the text below, I tested by including the "rg" >> operator in the properties list and now it worked. I also tested deleting >> your println and instead adding this if the text is red: >> >> System.out.print (textPos.getCharacter()); >> >> and so I got this output: >> >> 21_Key .1295 R~Wall Prof LinP 0.003 0.004 0.000 true >> >> which is exactly what is red in the PDF. >> >> Another way (probably better) to do it would probably be to not derive >> PDFTextStripper but |PDFStreamEngine and construct it with|| >> >> ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties")| >> >> >> see also http://stackoverflow.com/a/9157714/535646 >> >> Tilman >> >> >> Am 27.07.2014 12:14, schrieb Tilman Hausherr: >> >>> Hi, >>> >>> Do you still have the code that worked? >>> >>> I'm not the text extraction specialist here, but what I did was to look >>> in the uncompressed source of the PDF. The stream has code like this: >>> >>> 0 0 0 rg >>> 0 0.5019 0 rg >>> 1 0 0 rg >>> >>> The first line sets to black, the second to green, the third to red. And >>> from what I saw, it can't work at all, because the "rg" operator isn't >>> processed when extracting text, because PDFTextStripper.properties doesn't >>> contain the "rg" operator. (The operator is in another list, which is used >>> when rendering) >>> >>> So that is what puzzles me. I think it can't work at all. But you said >>> it did work at a time. >>> >>> Tilman >>> >>> >>> Am 27.07.2014 07:43, schrieb Tilman Hausherr: >>> >>>> Hi, >>>> >>>> Please upload the PDF somewhere and post the URL, PDF files are removed >>>> from the mailing list. >>>> >>>> Tilman >>>> >>>> Am 27.07.2014 02:35, schrieb -A: >>>> >>>>> Hello again. I've been trying to figure out this issue that has come >>>>> up for me and in my research I found someone posting on StackOverflow ( >>>>> http://stackoverflow.com/questions/10844271/how-to-get- >>>>> font-color-using-pdfbox) a similar issue where they could not read >>>>> any colors from a PDF. The user posted the code and someone else took it, >>>>> ran it, and reported that it worked. The users approach was different than >>>>> mine, but alas. >>>>> >>>>> I'm not sure at this point what is going on. I have stepped through >>>>> each individual character and checked the PDGraphicsState object, and even >>>>> when I am looking at an open file with visibly red text (attached) the >>>>> debugger only reports DeviceGray. If I print out the ColorSpace name from >>>>> the PDGraphicsState this is what is printed - for every character. >>>>> >>>>> I would appreciate if someone could perhaps run the attached text >>>>> stripper with the attached PDF file and report back if it actually prints >>>>> trueinstead of false, as it does for me. Since I saw this occurrence >>>>> elsewhere I'd like to rule that out - in case an IDE setting of some sort >>>>> may be causing this? >>>>> >>>>> It should be noted that I began using PDFBox with 1.8.5 and had this >>>>> code working fine. Still with 1.8.5 yesterday it was failing. Upgrading to >>>>> 1.8.6 yielded the same results. >>>>> >>>>> If this is an actual issue I do not mind attempting to solve it if >>>>> someone may have a general idea where to point me as to prevent needless >>>>> meddling with graphics state objects. Or, if this should be reported I can >>>>> do that as well. >>>>> >>>>> Thanks! >>>>> >>>>> -Aaron >>>>> >>>>> >>>>> >>>>> >>>>> *Previous Message:* >>>>> * >>>>> * >>>>> * >>>>> * >>>>> I've attached an updated stripper file with the only addition being a >>>>> main function to test the class specifically. >>>>> >>>>> When ran with the PDF I have also attached it indeed does not >>>>> recognize the red text. >>>>> >>>>> At this point it seems that this issue is solely dependent on PDFBox. >>>>> I'll stay tuned for some insight hopefully. If any other information is >>>>> needed, let me know! >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >> >> >