[jira] [Comment Edited] (PDFBOX-1804) PDFTextStripper Issue related to word positions not correctly being parsed

Tilman Hausherr (JIRA) Thu, 12 Dec 2013 21:07:06 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13847168#comment-13847168
 ]


Tilman Hausherr edited comment on PDFBOX-1804 at 12/13/13 5:05 AM:
-------------------------------------------------------------------

@Andy: you don't. Just attach your diff patches here, and a committer will 
decide whether to pick them up. Even if not, some developers might take them up 
for their local source copy of pdfbox.

See also here
https://www.apache.org/foundation/how-it-works.html#roles


was (Author: tilman):
@Andy: you don't. Just attached your a diff patches here, and a developer will 
decide whether to pick them up. Even if not, some users might take them up for 
themselves.

See also here
https://www.apache.org/foundation/how-it-works.html#roles

> PDFTextStripper Issue related to word positions not correctly being parsed
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-1804
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1804
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>            Reporter: Andy Phillips
>         Attachments: PDFBOX-1804.patch
>
>
> I found in a PDF I was pulling text from by using a custom PDFTextStripper 
> subclass that overrides writeString(String text, List<TextPosition> 
> textPositions) that i was getting the wrong textPositions that were not lined 
> up with the text.   I found that the test position of all “words” in a line 
> always come over as the “last” text positions of the last word in the line.   
> I found the issue in the PDFTextStripper class
> So here is the Code Issue:
>     /**
>      * Used within {@link #normalize(List, boolean, boolean)} to handle a 
> {@link TextPosition}.
>      * @return The StringBuilder that must be used when calling this method.
>      */
>     private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> 
> normalized,
>             StringBuilder lineBuilder, List<TextPosition> wordPositions, 
> TextPosition text)
>     {
>         if (text instanceof WordSeparator) 
>         {
>             normalized.add(createWord(lineBuilder.toString(), wordPositions));
>             lineBuilder = new StringBuilder();
>             wordPositions.clear();
>         }
>         else 
>         {
>             lineBuilder.append(text.getCharacter());
>             wordPositions.add(text);
>         }
>         return lineBuilder;
>     }
> When the normalizeAdd method, you create a new word passing the 
> wordPositions.   A reference to the wordPositions is stored in the new 
> WordWithTextPositions in the normalized linked list, but in the next line, 
> you clear().   Since the last wordPositions was passed as a reference, the 
> wordPositions is cleared in the WordWithTextPositions you just created.
> Soo, i would suggest you do the following:
> /**
>      * Used within {@link #normalize(List, boolean, boolean)} to handle a 
> {@link TextPosition}.
>      * @return The StringBuilder that must be used when calling this method.
>      */
>     private StringBuilder normalizeAdd(LinkedList<WordWithTextPositions> 
> normalized,
>             StringBuilder lineBuilder, List<TextPosition> wordPositions, 
> TextPosition text)
>     {
>         if (text instanceof WordSeparator) 
>         {
>             normalized.add(createWord(lineBuilder.toString(), new 
> ArrayList<TextPosition>(wordPositions)));
>             lineBuilder = new StringBuilder();
>             wordPositions.clear();
>         }
>         else 
>         {
>             lineBuilder.append(text.getCharacter());
>             wordPositions.add(text);
>         }
>         return lineBuilder;
>     }



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Comment Edited] (PDFBOX-1804) PDFTextStripper Issue related to word positions not correctly being parsed

Reply via email to