Sebastien,

I totally agree that this would be a good change, having run into the same
problem when working out my own mods to the text extraction some time ago.

Please create a JIRA issue proposing this at:
https://issues.apache.org/jira/browse/PDFBOX

Mel


-----Original Message-----
From: Sébastien Dailly [mailto:[email protected]] 
Sent: Tuesday, November 08, 2011 4:27 AM
To: [email protected]
Subject: PDFTextStripper : can't change the default TextPositionComparator

Hello,

I'm trying to use the PDFTextStripper class, but the sortByPosition does 
not seems to act correctly when the chararacters on the same line are 
not exactly on the same y position.

There is no way to replace the TextPositionComparator used in the class 
by my own, even by subclassing the PDFTextStripper class ( see later ).

One solution is to use a getter instead of a hard link between classes :

>             List<TextPosition> textList = charactersByArticle.get( i );
>             if( getSortByPosition() )
>             {
>                 TextPositionComparator comparator = new
TextPositionComparator();
>                 Collections.sort( textList, comparator );
>             }

become :

>             List<TextPosition> textList = charactersByArticle.get( i );
>             if( getSortByPosition() )
>             {
>                 Comparator comparator = getTextPositionComparator();
>                 Collections.sort( textList, comparator );
>             }

with getTextPositionComparator defined as following :

> private Class<? extends Comparator> textPositionComparator=
TextPositionComparator.class;

> […]

>       /**
>        *
>        * @return The comparator for ordening text position.
>        */
>       public Comparator getTextPositionComparator() {
>               try {
>                       return textPositionComparator.newInstance();
>               } catch (final InstantiationException e) {
>                       return null;
>               } catch (final IllegalAccessException e) {
>                       return null;
>               }
>       }

(with the appropriate setter).

Note :

Still the PDFTextStripper.writePage is protected, it uses the 
getTextPosition method from the PositionWrapper class, wich is a 
protected method, without subclassing this class ! This only works 
because they belong to the same package ! (I think it can be considered 
as a bug in the project architecture)

>                //Resets the average character width when we see a change
in font
>                 // or a change in the font size
>                 if(lastPosition != null && ((position.getFont() !=
lastPosition.getTextPosition().getFont())
>                         || (position.getFontSize() !=
lastPosition.getTextPosition().getFontSize())))
>                 {
>                     previousAveCharWidth = -1;
>                 }

Thank you,

-- 
Sébastien

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to