[ 
https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre updated PDFBOX-3804:
------------------------------
    Issue Type: Bug  (was: Improvement)

> Detect end of paragraphs
> ------------------------
>
>                 Key: PDFBOX-3804
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3804
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.6, 2.0.7, 3.0.0
>            Reporter: Alexandre
>              Labels: extraction, paragraph
>         Attachments: example.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward 
> improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it 
> does not detect correctly end of paragraphs.*
> *What is a paragraph ?* A paragraph is a text that contains one or several 
> sentences. It can start by a tabulation but this is not mandatory. In a 
> paragraph, there is one or more lines but there is no carriage return (except 
> the one at the very end). A paragraph can end before the very end of a line, 
> but some paragraphs end at the very end. If a paragraph ends at the very end 
> there is no new lines containing words after.
> *So, the last line of a paragraph ends before reaching the very end of the 
> line except if there is no new lines containing words after it.* Do you 
> follow me ? +And an algorithm could use that pattern to detect properly 
> paragraphs.+ 
> In my opinion, the algorithm should use the following information:
> (*) the +width of the block+ containing the paragraph ;
> (*) precomputed width of the +first word in the next line+.
> The +width of a block+ refers to the width of the area that contains the line 
> that contains the character the algorithm is evaluating at any steps.
> The algorithm runs on every character and when it reaches the +last character 
> of a line+, it pre computes +the first word of the next line+ to have it's 
> width.
> If +this word+ fits in the previous line after the +last character+, then the 
> algorithm concludes an end of paragraph (*case 1*).
> If there is no +next word+, then this is also the end of the paragraph (*case 
> 2*).
> If the +last character+ is far from the end of the block, we automatically 
> conclude for the end of a paragraph (*case 3 is optional*).
> Cheers,
> A.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to