Hello,
I'm trying to use the PDFTextStripper class, but the sortByPosition does
not seems to act correctly when the chararacters on the same line are
not exactly on the same y position.
There is no way to replace the TextPositionComparator used in the class
by my own, even by subclassing the PDFTextStripper class ( see later ).
One solution is to use a getter instead of a hard link between classes :
List<TextPosition> textList = charactersByArticle.get( i );
if( getSortByPosition() )
{
TextPositionComparator comparator = new
TextPositionComparator();
Collections.sort( textList, comparator );
}
become :
List<TextPosition> textList = charactersByArticle.get( i );
if( getSortByPosition() )
{
Comparator comparator = getTextPositionComparator();
Collections.sort( textList, comparator );
}
with getTextPositionComparator defined as following :
private Class<? extends Comparator> textPositionComparator=
TextPositionComparator.class;
[…]
/**
*
* @return The comparator for ordening text position.
*/
public Comparator getTextPositionComparator() {
try {
return textPositionComparator.newInstance();
} catch (final InstantiationException e) {
return null;
} catch (final IllegalAccessException e) {
return null;
}
}
(with the appropriate setter).
Note :
Still the PDFTextStripper.writePage is protected, it uses the
getTextPosition method from the PositionWrapper class, wich is a
protected method, without subclassing this class ! This only works
because they belong to the same package ! (I think it can be considered
as a bug in the project architecture)
//Resets the average character width when we see a change in font
// or a change in the font size
if(lastPosition != null && ((position.getFont() !=
lastPosition.getTextPosition().getFont())
|| (position.getFontSize() !=
lastPosition.getTextPosition().getFontSize())))
{
previousAveCharWidth = -1;
}
Thank you,
--
Sébastien