John,
I’m on it (tracking it down). I didn’t make any changes to anything related to
what PDFBox was doing I didn’t think; but of course could be wrong.
My first instinct was to download the new 1.8.6 (was using 1.8.5) but I get the
same result. I am currently looking at the other extended TextStripper classes
for some insight - but given that this stripper was working previously I’m not
sure what outside of this class could be affecting its result.
I have attached my extended class in a text document. If there is anything
glaring within there please let me know - I am going to start tracing the usage
paths to that class.
Thanks!
-Aaron
public class IncrementalPDFStripper extends PDFTextStripper
{
/**
* boolean to denote if a parsed file has red text in it
*/
private boolean hasRed;
/**
* IncrementalPDFStripper constructor
*
* @throws java.io.IOException
*/
public IncrementalPDFStripper() throws IOException
{
super();
super.setSortByPosition(true);
this.hasRed = false; // initialize to no red
}
/**
* Method to parse a PDF document.
*
* @param doc <code>PDDocument</code> of the PDF to be checked for red.
* @throws IOException
*/
public boolean containsRed(PDDocument doc) throws IOException
{
/**
* Set hasRed to false in case method is ran with same object in memory
*/
this.hasRed = false;
/**
* Get a list of pages within the document
*/
List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
// FOR every page in the document
for (PDPage page : pages) {
processStream(page, page.getResources(),
page.getContents().getStream()); // process the page
}
doc.close();
return hasRed;
}
/**
* Overridden method with simple functionality added to set a flag
* if a desired color is found.
*
* @param textPos <code>TextPosition</code> representing the current
position in the pages text.
*/
@Override
protected void processTextPosition(TextPosition textPos)
{
try
{
PDGraphicsState graphicsState = getGraphicsState();
// IF the current text contains RED
if (graphicsState.getNonStrokingColor().getJavaColor().getRed() ==
255)
{
this.hasRed = true;
}
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
}
}
On Jul 25, 2014, at 2:33 PM, John Hewson <[email protected]> wrote:
> Hi Aaron
>
> You’re probably going to have to track down the change that caused your
> code to stop functioning, are you working against the 2.0 trunk? There have
> been a number of changes recently which affect graphics state and text
> extraction.
>
> If you are working against the trunk then try checking out the latest version
> and setting a conditional breakpoint where you expect the red colour in
> processTextPosition and see if it gets hit: if not then it could be a new bug
> in PDFBox or some internal quirk of how you’re detecting red, in which case
> you might want to share the relevant line(s) of code.
>
> Cheers
>
> -- John
>
> On 25 Jul 2014, at 13:15, -A <[email protected]> wrote:
>
>> Hi again, everyone-
>>
>> Finishing up this program I am working on and heading back to the testing
>> phase - and suddenly my program is not detecting red text within PDF's. The
>> old method was just to override the TextStripper class and implement a
>> containsRed method that basically loops through every page and processes
>> the stream. I over-rode the processTextPosition method to check for Red
>> stroking colors at the given position.
>>
>> This was working. I had to also use a plain TextStripper class as my
>> extended version for some reason would error out getting all of the text
>> from the file. Just wanted to give some background that in my PDF class
>> that I created I am using two TextStrippers (thought they may be
>> conflicting). One to get all of the text, the other to see if there is red
>> within the text.
>>
>> I am trying to debug this but I have stepped through the entire files text
>> position to some actual red text - and it just shows up in the IDE as
>> System Grey, I believe (or some variant of that).
>>
>> It is perfectly plausible that I changed something inadvertently - but by
>> chance would any of you have any clue as to why it may not be seeing the
>> red text now?
>>
>>
>> Thank you for your guys' time!
>>
>> Sincerely,
>> Aaron
>>
>>
>> P.S. If John Hewson ends up responding to this feel free to write me
>> directly if it is more convenient.
>