Hi Sebastien,
It might be more flexible to inject an instance of rather than the class of
the Comparator. For comparators that take parameters, your current solution
won't work. In other words, you would have:
private Comparator textPositionComparator= new TextPositionComparator();
public Comparator getTextPositionComparator() {
return textPositionComparator;
}
public void setgetTextPositionComparator(Comparator comparator) {
textPositionComparator = comparator;
}
What do you think?
Regards,
Raimi
On Tue, Nov 8, 2011 at 10:24 AM, Martinez, Mel - 1004 - MITLL <
[email protected]> wrote:
> Sebastien,
>
> I totally agree that this would be a good change, having run into the same
> problem when working out my own mods to the text extraction some time ago.
>
> Please create a JIRA issue proposing this at:
> https://issues.apache.org/jira/browse/PDFBOX
>
> Mel
>
>
> -----Original Message-----
> From: Sébastien Dailly [mailto:[email protected]]
> Sent: Tuesday, November 08, 2011 4:27 AM
> To: [email protected]
> Subject: PDFTextStripper : can't change the default TextPositionComparator
>
> Hello,
>
> I'm trying to use the PDFTextStripper class, but the sortByPosition does
> not seems to act correctly when the chararacters on the same line are
> not exactly on the same y position.
>
> There is no way to replace the TextPositionComparator used in the class
> by my own, even by subclassing the PDFTextStripper class ( see later ).
>
> One solution is to use a getter instead of a hard link between classes :
>
> > List<TextPosition> textList = charactersByArticle.get( i );
> > if( getSortByPosition() )
> > {
> > TextPositionComparator comparator = new
> TextPositionComparator();
> > Collections.sort( textList, comparator );
> > }
>
> become :
>
> > List<TextPosition> textList = charactersByArticle.get( i );
> > if( getSortByPosition() )
> > {
> > Comparator comparator = getTextPositionComparator();
> > Collections.sort( textList, comparator );
> > }
>
> with getTextPositionComparator defined as following :
>
> > private Class<? extends Comparator> textPositionComparator=
> TextPositionComparator.class;
>
> > […]
>
> > /**
> > *
> > * @return The comparator for ordening text position.
> > */
> > public Comparator getTextPositionComparator() {
> > try {
> > return textPositionComparator.newInstance();
> > } catch (final InstantiationException e) {
> > return null;
> > } catch (final IllegalAccessException e) {
> > return null;
> > }
> > }
>
> (with the appropriate setter).
>
> Note :
>
> Still the PDFTextStripper.writePage is protected, it uses the
> getTextPosition method from the PositionWrapper class, wich is a
> protected method, without subclassing this class ! This only works
> because they belong to the same package ! (I think it can be considered
> as a bug in the project architecture)
>
> > //Resets the average character width when we see a change
> in font
> > // or a change in the font size
> > if(lastPosition != null && ((position.getFont() !=
> lastPosition.getTextPosition().getFont())
> > || (position.getFontSize() !=
> lastPosition.getTextPosition().getFontSize())))
> > {
> > previousAveCharWidth = -1;
> > }
>
> Thank you,
>
> --
> Sébastien
>
--
«To develop software is to build a machine simply by describing it.»
(Michael A. Jackson -- not the singer)
«Développer un logiciel revient à construire une machine tout simplement en
le décrivant.» (Michael A. Jackson - pas le chanteur)