[ 
https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre updated PDFBOX-3804:
------------------------------
    Description: 
Hi,

To extract text by paragraphs is probably the most looking forward improvement 
asked by PDFBox users.


*The current text extraction approach detects correctly end of lines. But it 
does not detect correctly end of paragraph.* Indeed, a carriage return 
character is added after each end of line.

*What is a paragraph ?* A paragraph is a text that contains one or several 
sentences. It can start by a tabulation but this is not mandatory. In a 
paragraph, there is one or more lines but there is no carriage return (except 
the one at the very end). A paragraph can end before the very end of a line, 
but some paragraphs end at the very end. To find paragraphs that ends before is 
a reasonable goal.

*So, the last line of a paragraph ends before reaching the very end of the 
line.* +And an algorithm could use that pattern to detect properly paragraphs.+ 

In my opinion, the algorithm never adds carriage return except when a end of 
paragraph is detect.

In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.

The width of a block is either the width of the pdf minus left and right 
margins if the pdf has one column. Or the width of a block is the width of a 
column if the pdf has two columns for example.

The algorithm runs on every character and when it reach the last character of a 
line, it pre computes the next line first word to have it's width. If this word 
fits in the previous line, then the algorithm concludes an end of paragraph.

Cheers,
A.

  was:
Hi,

To extract text by paragraphs is probably the most looking forward improvement 
asked by PDFBox users.


*The current text extraction approach detects correctly end of lines. But it 
does not detect correctly end of paragraph.* Indeed, a carriage return 
character is added after each end of line.

*What is a paragraph ?* A paragraph is a text that contains one or several 
sentences. It can start by a tabulation but this is not mandatory. In a 
paragraph, there is one or more lines but there is no carriage return (except 
the one at the very end). A paragraph can end before the very end of a line, 
but some paragraphs end at the very end.

*So, the last line of a paragraph ends before reaching the very end of the 
line.* +And an algorithm could use that pattern to detect properly paragraphs.+ 

In my opinion, the algorithm never adds carriage return except when a end of 
paragraph is detect.

In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.

The width of a block is either the width of the pdf minus left and right 
margins if the pdf has one column. Or the width of a block is the width of a 
column if the pdf has two columns for example.

The algorithm runs on every character and when it reach the last character of a 
line, it pre computes the next line first word to have it's width. If this word 
fits in the previous line, then the algorithm concludes an end of paragraph.

Cheers,
A.


> Detect end of paragraphs
> ------------------------
>
>                 Key: PDFBOX-3804
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3804
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.6, 2.0.7, 3.0.0
>            Reporter: Alexandre
>            Priority: Minor
>              Labels: extraction, paragraph
>         Attachments: example.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward 
> improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it 
> does not detect correctly end of paragraph.* Indeed, a carriage return 
> character is added after each end of line.
> *What is a paragraph ?* A paragraph is a text that contains one or several 
> sentences. It can start by a tabulation but this is not mandatory. In a 
> paragraph, there is one or more lines but there is no carriage return (except 
> the one at the very end). A paragraph can end before the very end of a line, 
> but some paragraphs end at the very end. To find paragraphs that ends before 
> is a reasonable goal.
> *So, the last line of a paragraph ends before reaching the very end of the 
> line.* +And an algorithm could use that pattern to detect properly 
> paragraphs.+ 
> In my opinion, the algorithm never adds carriage return except when a end of 
> paragraph is detect.
> In my opinion, the algorithm should use the following information:
> (*) the width of the block containing the paragraph ;
> (*) detect the padding and margin of that block ;
> (*) precomputed width of the next word.
> The width of a block is either the width of the pdf minus left and right 
> margins if the pdf has one column. Or the width of a block is the width of a 
> column if the pdf has two columns for example.
> The algorithm runs on every character and when it reach the last character of 
> a line, it pre computes the next line first word to have it's width. If this 
> word fits in the previous line, then the algorithm concludes an end of 
> paragraph.
> Cheers,
> A.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to