[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities

Andreas Meier (JIRA) Tue, 06 Oct 2015 06:14:57 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944999#comment-14944999
 ]


Andreas Meier edited comment on PDFBOX-2998 at 10/6/15 1:14 PM:
----------------------------------------------------------------

The question is, when is a group of textpositions forming a word.
My first thought is the location of the textposition, but it also depends on 
the font and the size of the font.

In my opinion we can achieve a lot if we just enhance the current code by 
adding some checks for font type and font size. There may be many other border 
conditions, if you know some more, let me know.

So one possible enhancement would be, to check that in the sorting algorithm:


{code:title=TextPositionComparator.java|borderStyle=solid}
...

public class TextPositionComparator implements Comparator<TextPosition>
{
    @Override
    public int compare(TextPosition pos1, TextPosition pos2)
    {
        // only compare text that is in the same direction
        if (pos1.getDir() < pos2.getDir())
        {
            return -1;
        }
        else if (pos1.getDir() > pos2.getDir())
        {
            return 1;
        }
        
        
        // FONT TYPE AND FONT SIZE CHECK START
        if (pos1.getFontSize() != pos2.getFontSize() ||
                !pos1.getFont().getName().equals(pos2.getFont().getName()))
        {
            return -1;
        }
        // FONT TYPE AND FONT SIZE CHECK END
        
        
        // get the text direction adjusted coordinates
        float x1 = pos1.getXDirAdj();
        float x2 = pos2.getXDirAdj();
        
        float pos1YBottom = pos1.getYDirAdj();
        float pos2YBottom = pos2.getYDirAdj(); 

        ...
{code}

(BTW, if you wonder why the code snippet checks for the font Name and not the 
font itself: some blanks will not be represented in the toUnicode-tables of the 
pdf. Therefore the pdfbox fallback solution is used, which uses other fonts for 
the missing characters)

Please correct me if I am wrong, this is just a simple minded idea that might 
work for some cases, but break others.


was (Author: andreasmeier):
The question is, when is a group of textpositions forming a word.
My first thought is the location of the textposition, but it also depends on 
the font and the size of the font.

In my opinion we can achieve a lot if we just enhance the current code by 
adding some checks for font type and font size. There may be many other border 
conditions, if you know some more, let me know.

So one possible enhancement would be, to check that in the sorting algorithm:


{code:title=TextPositionComparator.java|borderStyle=solid}
...

public class TextPositionComparator implements Comparator<TextPosition>
{
    @Override
    public int compare(TextPosition pos1, TextPosition pos2)
    {
        // only compare text that is in the same direction
        if (pos1.getDir() < pos2.getDir())
        {
            return -1;
        }
        else if (pos1.getDir() > pos2.getDir())
        {
            return 1;
        }
        
        
        // FONT TYPE AND FONT SIZE CHECK START
        if (pos1.getFontSize() != pos2.getFontSize() ||
                !pos1.getFont().getName().equals(pos2.getFont().getName()))
        {
            return -1;
        }
        // FONT TYPE AND FONT SIZE CHECK END
        
        
        // get the text direction adjusted coordinates
        float x1 = pos1.getXDirAdj();
        float x2 = pos2.getXDirAdj();
        
        float pos1YBottom = pos1.getYDirAdj();
        float pos2YBottom = pos2.getYDirAdj(); 

        ...
{code}


Please correct me if I am wrong, this is just a simple minded idea that might 
work for some cases, but break others.

> Enhance the text extraction capabilities
> ----------------------------------------
>
>                 Key: PDFBOX-2998
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2998
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>         Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities

Reply via email to