[jira] [Updated] (PDFBOX-3804) Detect end of paragraphs

Alexandre (JIRA) Mon, 22 May 2017 08:24:31 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexandre updated PDFBOX-3804:
------------------------------
    Description: 
Hi,

To extract text by paragraphs is probably the most looking forward improvement 
asked by PDFBox users.


*The current text extraction approach detects correctly end of lines. But it 
does not detect correctly end of paragraphs.*

*What is a paragraph ?* A paragraph is a text that contains one or several 
sentences. It can start by a tabulation but this is not mandatory. In a 
paragraph, there is one or more lines but there is no carriage return (except 
the one at the very end). A paragraph can end before the very end of a line, 
but some paragraphs end at the very end. If a paragraph ends at the very end 
there is no new lines containing words after.

*So, the last line of a paragraph ends before reaching the very end of the line 
except if there is no new lines containing words after it.* Do you follow me ? 
+And an algorithm could use that pattern to detect properly paragraphs.+ 

In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.

The width of a block is either the width of the pdf minus left and right 
margins if the pdf has one column. Or the width of a block is the width of a 
column if the pdf has two columns for example.

The algorithm runs on every character and when it reaches the +last character 
of a line+, it pre computes +the first word of the next line+ to have it's 
width.
If +this word+ fits in the previous line after the +last character+, then the 
algorithm concludes an end of paragraph (*case 1*).
If there is no +next word+, then this is also the end of the paragraph (*case 
2*).
If the +last character+ is far from the end of the block, we automatically 
conclude for the end of a paragraph (*case 3*).

Cheers,
A.

  was:
Hi,

To extract text by paragraphs is probably the most looking forward improvement 
asked by PDFBox users.


*The current text extraction approach detects correctly end of lines. But it 
does not detect correctly end of paragraphs.*

*What is a paragraph ?* A paragraph is a text that contains one or several 
sentences. It can start by a tabulation but this is not mandatory. In a 
paragraph, there is one or more lines but there is no carriage return (except 
the one at the very end). A paragraph can end before the very end of a line, 
but some paragraphs end at the very end. If a paragraph ends at the very end 
there is no new lines containing words after.

*So, the last line of a paragraph ends before reaching the very end of the line 
except if there is no new lines containing words after it.* Do you follow me ? 
+And an algorithm could use that pattern to detect properly paragraphs.+ 

In my opinion, the algorithm never adds carriage return except when a end of 
paragraph is detect.

In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.

The width of a block is either the width of the pdf minus left and right 
margins if the pdf has one column. Or the width of a block is the width of a 
column if the pdf has two columns for example.

The algorithm runs on every character and when it reaches the +last character 
of a line+, it pre computes +the first word of the next line+ to have it's 
width.
If +this word+ fits in the previous line after the +last character+, then the 
algorithm concludes an end of paragraph (*case 1*).
If there is no +next word+, then this is also the end of the paragraph (*case 
2*).
If the +last character+ is far from the end of the block, we automatically 
conclude for the end of a paragraph (*case 3*).

Cheers,
A.


> Detect end of paragraphs
> ------------------------
>
>                 Key: PDFBOX-3804
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3804
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.6, 2.0.7, 3.0.0
>            Reporter: Alexandre
>            Priority: Minor
>              Labels: extraction, paragraph
>         Attachments: example.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward 
> improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it 
> does not detect correctly end of paragraphs.*
> *What is a paragraph ?* A paragraph is a text that contains one or several 
> sentences. It can start by a tabulation but this is not mandatory. In a 
> paragraph, there is one or more lines but there is no carriage return (except 
> the one at the very end). A paragraph can end before the very end of a line, 
> but some paragraphs end at the very end. If a paragraph ends at the very end 
> there is no new lines containing words after.
> *So, the last line of a paragraph ends before reaching the very end of the 
> line except if there is no new lines containing words after it.* Do you 
> follow me ? +And an algorithm could use that pattern to detect properly 
> paragraphs.+ 
> In my opinion, the algorithm should use the following information:
> (*) the width of the block containing the paragraph ;
> (*) detect the padding and margin of that block ;
> (*) precomputed width of the next word.
> The width of a block is either the width of the pdf minus left and right 
> margins if the pdf has one column. Or the width of a block is the width of a 
> column if the pdf has two columns for example.
> The algorithm runs on every character and when it reaches the +last character 
> of a line+, it pre computes +the first word of the next line+ to have it's 
> width.
> If +this word+ fits in the previous line after the +last character+, then the 
> algorithm concludes an end of paragraph (*case 1*).
> If there is no +next word+, then this is also the end of the paragraph (*case 
> 2*).
> If the +last character+ is far from the end of the block, we automatically 
> conclude for the end of a paragraph (*case 3*).
> Cheers,
> A.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-3804) Detect end of paragraphs

Reply via email to