John,
I’m on it (tracking it down). I didn’t make any changes to anything related to 
what PDFBox was doing I didn’t think; but of course could be wrong.

My first instinct was to download the new 1.8.6 (was using 1.8.5) but I get the 
same result. I am currently looking at the other extended TextStripper classes 
for some insight - but given that this stripper was working previously I’m not 
sure what outside of this class could be affecting its result.

I have attached my extended class in a text document.  If there is anything 
glaring within there please let me know - I am going to start tracing the usage 
paths to that class.


Thanks!

-Aaron
public class IncrementalPDFStripper extends PDFTextStripper
{

    /**
     * boolean to denote if a parsed file has red text in it
     */
    private boolean hasRed;


    /**
     * IncrementalPDFStripper constructor
     *
     * @throws java.io.IOException
     */
    public IncrementalPDFStripper() throws IOException
    {

        super();

        super.setSortByPosition(true);

        this.hasRed = false;    // initialize to no red

    }

    /**
     * Method to parse a PDF document.
     *
     * @param doc <code>PDDocument</code> of the PDF to be checked for red.
     * @throws IOException
     */
    public boolean containsRed(PDDocument doc) throws IOException
    {


        /**
         * Set hasRed to false in case method is ran with same object in memory
         */
        this.hasRed = false;

        /**
         * Get a list of pages within the document
         */
        List<PDPage> pages = doc.getDocumentCatalog().getAllPages();

        // FOR every page in the document
        for (PDPage page : pages) {
            processStream(page, page.getResources(), 
page.getContents().getStream());   // process the page
        }

        doc.close();


        return hasRed;

    }

    /**
     * Overridden method with simple functionality added to set a flag
     * if a desired color is found.
     *
     * @param textPos <code>TextPosition</code> representing the current 
position in the pages text.
     */
    @Override
    protected void processTextPosition(TextPosition textPos)
    {
        try
        {
            PDGraphicsState graphicsState = getGraphicsState();

            // IF the current text contains RED
            if (graphicsState.getNonStrokingColor().getJavaColor().getRed() == 
255)
            {
                this.hasRed = true;
            }

        }
        catch (IOException ioe)
        {
            ioe.printStackTrace();
        }

    }


}

On Jul 25, 2014, at 2:33 PM, John Hewson <[email protected]> wrote:

> Hi Aaron
> 
> You’re probably going to have to track down the change that caused your
> code to stop functioning, are you working against the 2.0 trunk? There have
> been a number of changes recently which affect graphics state and text
> extraction.
> 
> If you are working against the trunk then try checking out the latest version
> and setting a conditional breakpoint where you expect the red colour in 
> processTextPosition and see if it gets hit: if not then it could be a new bug
> in PDFBox or some internal quirk of how you’re detecting red, in which case
> you might want to share the relevant line(s) of code.
> 
> Cheers
> 
> -- John
> 
> On 25 Jul 2014, at 13:15, -A <[email protected]> wrote:
> 
>> Hi again, everyone-
>> 
>> Finishing up this program I am working on and heading back to the testing
>> phase - and suddenly my program is not detecting red text within PDF's. The
>> old method was just to override the TextStripper class and implement a
>> containsRed method that basically loops through every page and processes
>> the stream. I over-rode the processTextPosition method to check for Red
>> stroking colors at the given position.
>> 
>> This was working. I had to also use a plain TextStripper class as my
>> extended version for some reason would error out getting all of the text
>> from the file. Just wanted to give some background that in my PDF class
>> that I created I am using two TextStrippers (thought they may be
>> conflicting). One to get all of the text, the other to see if there is red
>> within the text.
>> 
>> I am trying to debug this but I have stepped through the entire files text
>> position to some actual red text - and it just shows up in the IDE as
>> System Grey, I believe (or some variant of that).
>> 
>> It is perfectly plausible that I changed something inadvertently - but by
>> chance would any of you have any clue as to why it may not be seeing the
>> red text now?
>> 
>> 
>> Thank you for your guys' time!
>> 
>> Sincerely,
>> Aaron
>> 
>> 
>> P.S. If John Hewson ends up responding to this feel free to write me
>> directly if it is more convenient.
> 

Reply via email to