Andy Phillips created PDFBOX-1804:
-------------------------------------
Summary: PDFTextStripper Issue related to word positions not
correctly being parsed
Key: PDFBOX-1804
URL: https://issues.apache.org/jira/browse/PDFBOX-1804
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.3
Reporter: Andy Phillips
I found in a PDF I was pulling text from by using a custom PDFTextStripper
subclass that overrides writeString(String text, List<TextPosition>
textPositions) that i was getting the wrong textPositions that were not lined
up with the text. I found that the test position of all “words” in a line
always come over as the “last” text positions of the last word in the line. I
found the issue in the PDFTextStripper class
So here is the Code Issue:
/**
* Used within {@link #normalize(List, boolean, boolean)} to handle a
{@link TextPosition}.
* @return The StringBuilder that must be used when calling this method.
*/
private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
normalized,
StringBuilder lineBuilder, List<TextPosition> wordPositions,
TextPosition text)
{
if (text instanceof WordSeparator)
{
normalized.add(createWord(lineBuilder.toString(), wordPositions));
lineBuilder = new StringBuilder();
wordPositions.clear();
}
else
{
lineBuilder.append(text.getCharacter());
wordPositions.add(text);
}
return lineBuilder;
}
When the normalizeAdd method, you create a new word passing the wordPositions.
A reference to the wordPositions is stored in the new WordWithTextPositions in
the normalized linked list, but in the next line, you clear(). Since the last
wordPositions was passed as a reference, the wordPositions is cleared in the
WordWithTextPositions you just created.
Soo, i would suggest you do the following:
/**
* Used within {@link #normalize(List, boolean, boolean)} to handle a
{@link TextPosition}.
* @return The StringBuilder that must be used when calling this method.
*/
private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
normalized,
StringBuilder lineBuilder, List<TextPosition> wordPositions,
TextPosition text)
{
if (text instanceof WordSeparator)
{
normalized.add(createWord(lineBuilder.toString(), new
ArrayList<TextPosition>(wordPositions)));
lineBuilder = new StringBuilder();
wordPositions.clear();
}
else
{
lineBuilder.append(text.getCharacter());
wordPositions.add(text);
}
return lineBuilder;
}
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)