Owen McGovern created PDFBOX-5545:
-------------------------------------

             Summary: PDFTextStripper - Expose a setter for the 
TextPositionComparator
                 Key: PDFBOX-5545
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5545
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 3.0.4 JBIG2
            Reporter: Owen McGovern


I process a lot of medical related PDF files with a lot of superscripts, 
subscripts, out of order characters etc.

We tend to have trouble with the sortByPosition flag in PDFTextStripper.

If it's not enabled, we end up with characters which are out of order in some 
PDFs.

If we do enable it it sometimes messes up superscript and subscript positions.

Can you expose a setter for the comparator instance, so that I can try to 
correct it ? E.g. 

 
{code:java}
private Comparator<TextPosition> textPositionComparator = new 
TextPositionComparator();

    /**
     *
     * @param newTextPositionComparator
     */
    public void setTextPositionComparator(final Comparator<TextPosition> 
newTextPositionComparator) {
        this.textPositionComparator = newTextPositionComparator;
    }
 {code}
Then in the writePage() method, just use that comparator?

I want to try to implement a comparator that fixes this eg. something like this 
(in Kotlin)

 
{code:java}
import org.apache.pdfbox.text.TextPosition
import kotlin.math.abs

class TextPositionSubscriptComparator : Comparator<TextPosition> {

    override fun compare(pos1: TextPosition, pos2: TextPosition): Int {

        val textDir = pos1.dir.compareTo(pos2.dir)
        return if (textDir != 0) {
            textDir
        } else {
            val x1 = pos1.xDirAdj
            val x2 = pos2.xDirAdj
            val pos1YBottom = pos1.yDirAdj
            val pos2YBottom = pos2.yDirAdj
            val pos1YTop = pos1YBottom - pos1.heightDir
            val pos2YTop = pos2YBottom - pos2.heightDir
            val yDifference = abs(pos1YBottom - pos2YBottom)

            // Superscript / subscript detection - when height differences 
between non-blank characters
            val overlap = if (pos1.heightDir > pos2.heightDir)
                pos1.heightDir * INV_SIZE_RATIO_DIFFERENCE
            else
                pos2.heightDir * INV_SIZE_RATIO_DIFFERENCE

            if ((yDifference.toDouble() < overlap || pos2YBottom >= pos1YTop) 
&& pos2YBottom <= pos1YBottom || pos1YBottom in pos2YTop..pos2YBottom) {
                x1.compareTo(x2)
            } else {
                if (pos1YBottom < pos2YBottom) -1 else 1
            }
        }
    }

    companion object {
        private const val SIZE_RATIO_DIFFERENCE = 0.85f
        private const val INV_SIZE_RATIO_DIFFERENCE = 1f - SIZE_RATIO_DIFFERENCE
    }

}

 {code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to