[ 
https://issues.apache.org/jira/browse/PDFBOX-4054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313919#comment-16313919
 ] 

Tilman Hausherr commented on PDFBOX-4054:
-----------------------------------------

After reading your text a second time, it seems that what you want is some 
heuristic to identify structure hierarchies in a text. As mkl said to the other 
person "Recognizing paragraphs in that extract would be your job." We do have 
some sort of paragraph detection without remembering the positions, but it 
isn't perfect and doesn't support columns.

> allow to access positions of text extracted by PDFTextStripper
> --------------------------------------------------------------
>
>                 Key: PDFBOX-4054
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4054
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.8.13
>         Environment: any
>            Reporter: Wolfgang Fahl
>            Priority: Critical
>
> https://stackoverflow.com/questions/25109969/how-to-extract-a-paragraph-from-a-pdf-file-and-store-its-position/48119163?noredirect=1#comment83218312_48119163
> describes a need that pdftotext -bbox-layout fulfills by supplying structural 
> information 
> for the text extraction. 
> There has been no PDFBox answer for a while so I assume such a feature is 
> missing.
> A similar approach would be a useful improvement ot PDFBox and much wanted 
> for certain applications - e.g. when the position of a text on a page is 
> important for it's meaning.
> The poppler xhtml approach supplies for example:
> <flow>
>   <block xMin="333.000000" yMin="270.150000" xMax="360.004000" 
> yMax="275.150000">
>     <line xMin="333.000000" yMin="270.150000" xMax="360.004000" 
> yMax="275.150000">
>       <word xMin="333.000000" yMin="270.150000" xMax="342.896500" 
> yMax="275.150000">Your</word>
>       <word xMin="347.047500" yMin="270.150000" xMax="360.004000" 
> yMax="275.150000">Bank</word>
>     </line>
>   </block>
> </flow>
> flow/block/line/word is a hierachy and you get position information for block 
> and line.
> PdfBox could supply similar information via callbacks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to