[jira] [Commented] (PDFBOX-4376) Get text within pdf by paragraphs.

Tilman Hausherr (JIRA) Mon, 12 Nov 2018 09:18:08 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684116#comment-16684116
 ]


Tilman Hausherr commented on PDFBOX-4376:
-----------------------------------------

You can delimit paragraphs by using {{stripper.setParagraphStart("XXXX")}} and 
{{stripper.setParagraphEnd("YYYY")}}. I tested this with your file. However 
this won't work the way you wish, because what you consider to be one paragraph 
are really 5. I managed to solve this by calling 
{{stripper.setDropThreshold(5)}}. However it is possible that you'll have other 
separate paragraphs now coming together.

Another alternative might be to use the structure information in the PDF. These 
are present in your PDF, but this is poorly supported by PDFBox, and nobody of 
the core team is a specialist in this topic.

You can open your file with PDFDebugger, then choose "View", "Choose internal 
structure" in the menu, then go to Root/StructTreeRoot and find out what's 
there.

> Get text within pdf by paragraphs.
> ----------------------------------
>
>                 Key: PDFBOX-4376
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4376
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Kaushlendra Singh
>            Priority: Major
>         Attachments: Sample.pdf
>
>
> There is a scenario in which I have to fetch the text within pdf page 
> paragraph wise not line by line. All these text paragraphs are built from 
> text frames created by Indesign editor.
> For example: In attached pdf document, my requirement is to fetch complete 
> text of bounding box all at once along with its coordinates starting from 
> "For your card ending in: XXXX" and ending at "purchases into gateways."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4376) Get text within pdf by paragraphs.

Reply via email to