[
https://issues.apache.org/jira/browse/PDFBOX-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-5545:
---------------------------------------
Affects Version/s: 3.0.0 PDFBox
(was: 3.0.4 JBIG2)
> PDFTextStripper - Expose a setter for the TextPositionComparator
> ----------------------------------------------------------------
>
> Key: PDFBOX-5545
> URL: https://issues.apache.org/jira/browse/PDFBOX-5545
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 3.0.0 PDFBox
> Reporter: Owen McGovern
> Priority: Major
>
> I process a lot of medical related PDF files with a lot of superscripts,
> subscripts, out of order characters etc.
> We tend to have trouble with the sortByPosition flag in PDFTextStripper.
> If it's not enabled, we end up with characters which are out of order in some
> PDFs.
> If we do enable it it sometimes messes up superscript and subscript positions.
> Can you expose a setter for the comparator instance, so that I can try to
> correct it ? E.g.
>
> {code:java}
> private Comparator<TextPosition> textPositionComparator = new
> TextPositionComparator();
> /**
> *
> * @param newTextPositionComparator
> */
> public void setTextPositionComparator(final Comparator<TextPosition>
> newTextPositionComparator) {
> this.textPositionComparator = newTextPositionComparator;
> }
> {code}
> Then in the writePage() method, just use that comparator?
>
> Users can then potentially inject their own comparator implementation in.
> I want to try to implement a comparator that fixes sorting with
> subscript/superscript tolerances, eg. something like this (in Kotlin and
> completely untested so far... )
>
> {code:java}
> import org.apache.pdfbox.text.TextPosition
> import kotlin.math.abs
> class TextPositionSubscriptComparator : Comparator<TextPosition> {
> override fun compare(pos1: TextPosition, pos2: TextPosition): Int {
> val textDir = pos1.dir.compareTo(pos2.dir)
> return if (textDir != 0) {
> textDir
> } else {
> val x1 = pos1.xDirAdj
> val x2 = pos2.xDirAdj
> val pos1YBottom = pos1.yDirAdj
> val pos2YBottom = pos2.yDirAdj
> val pos1YTop = pos1YBottom - pos1.heightDir
> val pos2YTop = pos2YBottom - pos2.heightDir
> val yDifference = abs(pos1YBottom - pos2YBottom)
> // Superscript / subscript tolerance by ratio of the character
> height
> val overlap = if (pos1.heightDir > pos2.heightDir)
> pos1.heightDir * INV_SIZE_RATIO_DIFFERENCE
> else
> pos2.heightDir * INV_SIZE_RATIO_DIFFERENCE
> if ((yDifference.toDouble() < overlap || pos2YBottom >= pos1YTop)
> && pos2YBottom <= pos1YBottom || pos1YBottom in pos2YTop..pos2YBottom) {
> x1.compareTo(x2)
> } else {
> if (pos1YBottom < pos2YBottom) -1 else 1
> }
> }
> }
> companion object {
> private const val SIZE_RATIO_DIFFERENCE = 0.85f
> private const val INV_SIZE_RATIO_DIFFERENCE = 1f -
> SIZE_RATIO_DIFFERENCE
> }
> }
> {code}
>
> It could greatly help if the sorting comparator was configurable.
>
> regards,
> Owen
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]