After having written the text below, I tested by including the "rg"
operator in the properties list and now it worked. I also tested
deleting your println and instead adding this if the text is red:
System.out.print (textPos.getCharacter());
and so I got this output:
21_Key .1295 R~Wall Prof LinP 0.003 0.004 0.000 true
which is exactly what is red in the PDF.
Another way (probably better) to do it would probably be to not derive
PDFTextStripper but |PDFStreamEngine and construct it with||
ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties")|
see also http://stackoverflow.com/a/9157714/535646
Tilman
Am 27.07.2014 12:14, schrieb Tilman Hausherr:
Hi,
Do you still have the code that worked?
I'm not the text extraction specialist here, but what I did was to
look in the uncompressed source of the PDF. The stream has code like
this:
0 0 0 rg
0 0.5019 0 rg
1 0 0 rg
The first line sets to black, the second to green, the third to red.
And from what I saw, it can't work at all, because the "rg" operator
isn't processed when extracting text, because
PDFTextStripper.properties doesn't contain the "rg" operator. (The
operator is in another list, which is used when rendering)
So that is what puzzles me. I think it can't work at all. But you said
it did work at a time.
Tilman
Am 27.07.2014 07:43, schrieb Tilman Hausherr:
Hi,
Please upload the PDF somewhere and post the URL, PDF files are
removed from the mailing list.
Tilman
Am 27.07.2014 02:35, schrieb -A:
Hello again. I've been trying to figure out this issue that has come
up for me and in my research I found someone posting on
StackOverflow
(http://stackoverflow.com/questions/10844271/how-to-get-font-color-using-pdfbox)
a similar issue where they could not read any colors from a PDF. The
user posted the code and someone else took it, ran it, and reported
that it worked. The users approach was different than mine, but alas.
I'm not sure at this point what is going on. I have stepped through
each individual character and checked the PDGraphicsState object,
and even when I am looking at an open file with visibly red text
(attached) the debugger only reports DeviceGray. If I print out the
ColorSpace name from the PDGraphicsState this is what is printed -
for every character.
I would appreciate if someone could perhaps run the attached text
stripper with the attached PDF file and report back if it actually
prints trueinstead of false, as it does for me. Since I saw this
occurrence elsewhere I'd like to rule that out - in case an IDE
setting of some sort may be causing this?
It should be noted that I began using PDFBox with 1.8.5 and had this
code working fine. Still with 1.8.5 yesterday it was failing.
Upgrading to 1.8.6 yielded the same results.
If this is an actual issue I do not mind attempting to solve it if
someone may have a general idea where to point me as to prevent
needless meddling with graphics state objects. Or, if this should be
reported I can do that as well.
Thanks!
-Aaron
*Previous Message:*
*
*
*
*
I've attached an updated stripper file with the only addition being
a main function to test the class specifically.
When ran with the PDF I have also attached it indeed does not
recognize the red text.
At this point it seems that this issue is solely dependent on
PDFBox. I'll stay tuned for some insight hopefully. If any other
information is needed, let me know!