[jira] [Created] (PDFBOX-4054) allow to access positions of text extracted by

Wolfgang Fahl (JIRA) Fri, 05 Jan 2018 13:01:39 -0800

Wolfgang Fahl created PDFBOX-4054:
-------------------------------------

             Summary: allow to access positions of text extracted by 
                 Key: PDFBOX-4054
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4054
             Project: PDFBox
          Issue Type: Improvement
    Affects Versions: 1.8.13
         Environment: any
            Reporter: Wolfgang Fahl
            Priority: Critical



https://stackoverflow.com/questions/25109969/how-to-extract-a-paragraph-from-a-pdf-file-and-store-its-position/48119163?noredirect=1#comment83218312_48119163

describes a need that pdftotext -bbox-layout fulfills by supplying structural 
information 
for the text extraction. 

There has been no PDFBox answer for a while so I assume such a feature is 
missing.

A similar approach would be a useful improvement ot PDFBox and much wanted for 
certain applications - e.g. when the position of a text on a page is important 
for it's meaning.

The poppler xhtml approach supplies for example:
<flow>
  <block xMin="333.000000" yMin="270.150000" xMax="360.004000" 
yMax="275.150000">
    <line xMin="333.000000" yMin="270.150000" xMax="360.004000" 
yMax="275.150000">
      <word xMin="333.000000" yMin="270.150000" xMax="342.896500" 
yMax="275.150000">Your</word>
      <word xMin="347.047500" yMin="270.150000" xMax="360.004000" 
yMax="275.150000">Bank</word>
    </line>
  </block>
</flow>

flow/block/line/word is a hierachy and you get position information for block 
and line.
PdfBox could supply similar information via callbacks.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-4054) allow to access positions of text extracted by

Reply via email to