Andy Phillips created PDFBOX-1804:
-------------------------------------

             Summary: PDFTextStripper Issue related to word positions not 
correctly being parsed
                 Key: PDFBOX-1804
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1804
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.3
            Reporter: Andy Phillips


I found in a PDF I was pulling text from by using a custom PDFTextStripper 
subclass that overrides writeString(String text, List<TextPosition> 
textPositions) that i was getting the wrong textPositions that were not lined 
up with the text.   I found that the test position of all “words” in a line 
always come over as the “last” text positions of the last word in the line.   I 
found the issue in the PDFTextStripper class

So here is the Code Issue:

    /**
     * Used within {@link #normalize(List, boolean, boolean)} to handle a 
{@link TextPosition}.
     * @return The StringBuilder that must be used when calling this method.
     */
    private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> 
normalized,
            StringBuilder lineBuilder, List<TextPosition> wordPositions, 
TextPosition text)
    {
        if (text instanceof WordSeparator) 
        {
            normalized.add(createWord(lineBuilder.toString(), wordPositions));
            lineBuilder = new StringBuilder();
            wordPositions.clear();
        }
        else 
        {
            lineBuilder.append(text.getCharacter());
            wordPositions.add(text);
        }
        return lineBuilder;
    }


When the normalizeAdd method, you create a new word passing the wordPositions.  
 A reference to the wordPositions is stored in the new WordWithTextPositions in 
the normalized linked list, but in the next line, you clear().   Since the last 
wordPositions was passed as a reference, the wordPositions is cleared in the 
WordWithTextPositions you just created.

Soo, i would suggest you do the following:
/**
     * Used within {@link #normalize(List, boolean, boolean)} to handle a 
{@link TextPosition}.
     * @return The StringBuilder that must be used when calling this method.
     */
    private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> 
normalized,
            StringBuilder lineBuilder, List<TextPosition> wordPositions, 
TextPosition text)
    {
        if (text instanceof WordSeparator) 
        {
            normalized.add(createWord(lineBuilder.toString(), new 
ArrayList<TextPosition>(wordPositions)));
            lineBuilder = new StringBuilder();
            wordPositions.clear();
        }
        else 
        {
            lineBuilder.append(text.getCharacter());
            wordPositions.add(text);
        }
        return lineBuilder;
    }




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to