[jira] [Created] (PDFBOX-3804) Detect end of paragraphs

Alexandre (JIRA) Mon, 22 May 2017 07:51:05 -0700

Alexandre created PDFBOX-3804:
---------------------------------

             Summary: Detect end of paragraphs
                 Key: PDFBOX-3804
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3804
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 2.0.6, 3.0.0
            Reporter: Alexandre
            Priority: Minor
         Attachments: example.pdf


Hi,

To extract text by paragraphs is probably the most looking forward improvement 
asked by PDFBox users.


*The current text extraction approach detects correctly end of lines. But it 
does not detect correctly end of paragraph.* Indeed, a carriage return 
character is added after each new line.

*What is a paragraph ?* A paragraph is a text that contain one or several 
sentences. It can start by a tabulation but this is not mandatory. In a 
paragraph, there is one or more lines but there is no carriage return (except 
the one at the very end). A paragraph can end before the very end of a line, 
but some paragraphs end at the very end.

*So, the last line of a paragraph ends before reaching the very end of the 
line.* +And we could use that pattern detect properly paragraphs.+ 

In my opinion, the algorithm never adds carriage return except when a end of 
paragraph is detect.

In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.

The width of a block is either the width of the pdf minus left and right 
margins if the pdf has one column. Or the width of a block is the width of a 
column if the pdf has two columns for example.

The algorithm runs on every character and when it reach the last character of a 
line, it pre computes the next line first word to have it's width. If this word 
fits in the previous line, then we conclude we have a end of paragraph.

Best,
A.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-3804) Detect end of paragraphs

Reply via email to