[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities

Andreas Meier (JIRA) Tue, 06 Oct 2015 23:03:55 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946324#comment-14946324
 ]


Andreas Meier edited comment on PDFBOX-2998 at 10/7/15 6:03 AM:
----------------------------------------------------------------

I just wanted to fuel the discussion with my snippet.
My intention is not to provide code that breaks an already great extraction 
engine ;)

{quote}
I'd even start a step before that
{quote}
Depends on what is possible at the lower Levels...

I don't know if I am the right person to take part in that discussion any 
further, but I will try to provide the "simple view" on a higher level, to 
address the problem:
 
- Might it be useful to hold some Information like "(Hello World)" in a 
(meta-)information store, so that pdfbox can later take the single characters 
and form the word again? (No fonttype or -size needed, just simple character 
matching based on position and Rotation...)
- Would it make sense to check for fonttype and -size and just handle cases 
like chemical names? [~tboehme] are there any other reasons for different 
font/size in a word you know?




was (Author: andreasmeier):
I just wanted to fuel the discussion with my snippet.
My intention is not to provide code that breaks an already great extraction 
engine ;)

{quote}
I'd even start a step before that
{quote}
Depends on what is possible at the lower Levels...

I don't know if I am the right person to take part in that discussion any 
further, but I will try to provide the "simple view" on a higher level, to 
address the problem:
 
- Might it be useful to hold some Information like "(Hello World)" in a 
(meta-)information store, so that pdfbox can later take the single characters 
and form the word again? (No fonttype or -size needed, just simple character 
matching based on position and Rotation...)
- Would it make sense to check for fonttype and -size and just handle cases 
like checmical names ([~tboehme] are there any other reasons for different 
font/size in a word you know?)



> Enhance the text extraction capabilities
> ----------------------------------------
>
>                 Key: PDFBOX-2998
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2998
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Andreas Meier
>         Attachments: TextBehindText.pdf
>
>
> PDFBox will need some -document layout analysis tools- enhancement to the 
> current text extraction to extract text correctly.
> At the Moment the text of a document is extracted using the position of 
> single characters.
> This may lead to wrong results, due to the format of the file.
> There are good tools such as  https://code.google.com/p/lapdftext which we 
> could use to compare our current output.
> Possible enhancements are
> - enhance matching of text to a certain line i.e. don't mix up text from 
> different lines
> - better handling of rotated text
> - handling of vertical text
> - ability to get additional text properties such as font, font size ...
> Some of these are already logged as individual tickets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-2998) Enhance the text extraction capabilities

Reply via email to