[jira] [Commented] (PDFBOX-4054) allow to access positions of text extracted by PDFTextStripper

Wolfgang Fahl (JIRA) Fri, 05 Jan 2018 13:17:19 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-4054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313898#comment-16313898
 ]


Wolfgang Fahl commented on PDFBOX-4054:
---------------------------------------

[poppler][1]'s

    pdftotext -bbox-layout

will create xhtml which has a 

flow/block/line/word 

structure e.g.

    <flow>
      <block xMin="333.000000" yMin="270.150000" xMax="360.004000" 
yMax="275.150000">
        <line xMin="333.000000" yMin="270.150000" xMax="360.004000" 
yMax="275.150000">
          <word xMin="333.000000" yMin="270.150000" xMax="342.896500" 
yMax="275.150000">Your</word>
          <word xMin="347.047500" yMin="270.150000" xMax="360.004000" 
yMax="275.150000">Bank</word>
        </line>
      </block>
    </flow>

You can select the details e.g. with [xmlstarlet][2]

    xmlstarlet sel \
      -N x="http://www.w3.org/1999/xhtml"; \
      -t \
      -m "//x:word" \
      -v "concat(.,'-',./@xMin,'-',./@yMin,'-',./@xMax,'-',./@yMax)" \
      -n \
        input.html

which will give you the words and positions line by line. Just modify these 
things to your need.

    Your-333.000000-270.150000-343.503500-275.150000
    Bank-347.707500-271.150000-360.991000-276.150000


  [1]: https://poppler.freedesktop.org/ "poppler"
  [2]: 
http://manpages.ubuntu.com/manpages/zesty/man1/xmlstarlet.1.html%20xmlstarlet

> allow to access positions of text extracted by PDFTextStripper
> --------------------------------------------------------------
>
>                 Key: PDFBOX-4054
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4054
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.8.13
>         Environment: any
>            Reporter: Wolfgang Fahl
>            Priority: Critical
>
> https://stackoverflow.com/questions/25109969/how-to-extract-a-paragraph-from-a-pdf-file-and-store-its-position/48119163?noredirect=1#comment83218312_48119163
> describes a need that pdftotext -bbox-layout fulfills by supplying structural 
> information 
> for the text extraction. 
> There has been no PDFBox answer for a while so I assume such a feature is 
> missing.
> A similar approach would be a useful improvement ot PDFBox and much wanted 
> for certain applications - e.g. when the position of a text on a page is 
> important for it's meaning.
> The poppler xhtml approach supplies for example:
> <flow>
>   <block xMin="333.000000" yMin="270.150000" xMax="360.004000" 
> yMax="275.150000">
>     <line xMin="333.000000" yMin="270.150000" xMax="360.004000" 
> yMax="275.150000">
>       <word xMin="333.000000" yMin="270.150000" xMax="342.896500" 
> yMax="275.150000">Your</word>
>       <word xMin="347.047500" yMin="270.150000" xMax="360.004000" 
> yMax="275.150000">Bank</word>
>     </line>
>   </block>
> </flow>
> flow/block/line/word is a hierachy and you get position information for block 
> and line.
> PdfBox could supply similar information via callbacks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4054) allow to access positions of text extracted by PDFTextStripper

Reply via email to