[
https://issues.apache.org/jira/browse/PDFBOX-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-1804.
----------------------------------------
Resolution: Fixed
Fix Version/s: 2.0.0
1.8.4
Assignee: Andreas Lehmkühler
I added the proposed fix in revisions 1554774 (trunk) and 1554775 (1.8 branch.
Thanks for the contribution!
> PDFTextStripper Issue related to word positions not correctly being parsed
> --------------------------------------------------------------------------
>
> Key: PDFBOX-1804
> URL: https://issues.apache.org/jira/browse/PDFBOX-1804
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.3
> Reporter: Andy Phillips
> Assignee: Andreas Lehmkühler
> Fix For: 1.8.4, 2.0.0
>
> Attachments: PDFBOX-1804.patch
>
>
> I found in a PDF I was pulling text from by using a custom PDFTextStripper
> subclass that overrides writeString(String text, List<TextPosition>
> textPositions) that i was getting the wrong textPositions that were not lined
> up with the text. I found that the test position of all “words” in a line
> always come over as the “last” text positions of the last word in the line.
> I found the issue in the PDFTextStripper class
> So here is the Code Issue:
> /**
> * Used within {@link #normalize(List, boolean, boolean)} to handle a
> {@link TextPosition}.
> * @return The StringBuilder that must be used when calling this method.
> */
> private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
> normalized,
> StringBuilder lineBuilder, List<TextPosition> wordPositions,
> TextPosition text)
> {
> if (text instanceof WordSeparator)
> {
> normalized.add(createWord(lineBuilder.toString(), wordPositions));
> lineBuilder = new StringBuilder();
> wordPositions.clear();
> }
> else
> {
> lineBuilder.append(text.getCharacter());
> wordPositions.add(text);
> }
> return lineBuilder;
> }
> When the normalizeAdd method, you create a new word passing the
> wordPositions. A reference to the wordPositions is stored in the new
> WordWithTextPositions in the normalized linked list, but in the next line,
> you clear(). Since the last wordPositions was passed as a reference, the
> wordPositions is cleared in the WordWithTextPositions you just created.
> Soo, i would suggest you do the following:
> /**
> * Used within {@link #normalize(List, boolean, boolean)} to handle a
> {@link TextPosition}.
> * @return The StringBuilder that must be used when calling this method.
> */
> private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions>
> normalized,
> StringBuilder lineBuilder, List<TextPosition> wordPositions,
> TextPosition text)
> {
> if (text instanceof WordSeparator)
> {
> normalized.add(createWord(lineBuilder.toString(), new
> ArrayList<TextPosition>(wordPositions)));
> lineBuilder = new StringBuilder();
> wordPositions.clear();
> }
> else
> {
> lineBuilder.append(text.getCharacter());
> wordPositions.add(text);
> }
> return lineBuilder;
> }
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)