[
https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexandre updated PDFBOX-3804:
------------------------------
Description:
Hi,
To extract text by paragraphs is probably the most looking forward improvement
asked by PDFBox users.
*The current text extraction approach detects correctly end of lines. But it
does not detect correctly end of paragraphs.*
*What is a paragraph ?* A paragraph is a text that contains one or several
sentences. It can start by a tabulation but this is not mandatory. In a
paragraph, there is one or more lines but there is no carriage return (except
the one at the very end). A paragraph can end before the very end of a line,
but some paragraphs end at the very end. If a paragraph ends at the very end
there is no new lines containing words after.
*So, the last line of a paragraph ends before reaching the very end of the line
except if there is no new lines containing words after it.* Do you follow me ?
+And an algorithm could use that pattern to detect properly paragraphs.+
In my opinion, the algorithm should use the following information:
(*) the +width of the block+ containing the paragraph ;
(*) precomputed width of the +first word in the next line+.
The +width of a block+ refers to the width of the area that contains the line
that contains the character the algorithm is evaluating at any steps.
The algorithm runs on every character and when it reaches the +last character
of a line+, it pre computes +the first word of the next line+ to have it's
width.
If +this word+ fits in the previous line after the +last character+, then the
algorithm concludes an end of paragraph (*case 1*).
If there is no +next word+, then this is also the end of the paragraph (*case
2*).
If the +last character+ is far from the end of the block, we automatically
conclude for the end of a paragraph (*case 3*).
Cheers,
A.
was:
Hi,
To extract text by paragraphs is probably the most looking forward improvement
asked by PDFBox users.
*The current text extraction approach detects correctly end of lines. But it
does not detect correctly end of paragraphs.*
*What is a paragraph ?* A paragraph is a text that contains one or several
sentences. It can start by a tabulation but this is not mandatory. In a
paragraph, there is one or more lines but there is no carriage return (except
the one at the very end). A paragraph can end before the very end of a line,
but some paragraphs end at the very end. If a paragraph ends at the very end
there is no new lines containing words after.
*So, the last line of a paragraph ends before reaching the very end of the line
except if there is no new lines containing words after it.* Do you follow me ?
+And an algorithm could use that pattern to detect properly paragraphs.+
In my opinion, the algorithm should use the following information:
(*) the +width of the block+ containing the paragraph ;
(*) precomputed width of the +first word in the next line+.
The +width of a block+ is either the width of the pdf minus left and right
margins if the pdf has one column. Or the width of a block is the width of a
column if the pdf has two columns for example.
The algorithm runs on every character and when it reaches the +last character
of a line+, it pre computes +the first word of the next line+ to have it's
width.
If +this word+ fits in the previous line after the +last character+, then the
algorithm concludes an end of paragraph (*case 1*).
If there is no +next word+, then this is also the end of the paragraph (*case
2*).
If the +last character+ is far from the end of the block, we automatically
conclude for the end of a paragraph (*case 3*).
Cheers,
A.
> Detect end of paragraphs
> ------------------------
>
> Key: PDFBOX-3804
> URL: https://issues.apache.org/jira/browse/PDFBOX-3804
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.6, 2.0.7, 3.0.0
> Reporter: Alexandre
> Priority: Minor
> Labels: extraction, paragraph
> Attachments: example.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward
> improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it
> does not detect correctly end of paragraphs.*
> *What is a paragraph ?* A paragraph is a text that contains one or several
> sentences. It can start by a tabulation but this is not mandatory. In a
> paragraph, there is one or more lines but there is no carriage return (except
> the one at the very end). A paragraph can end before the very end of a line,
> but some paragraphs end at the very end. If a paragraph ends at the very end
> there is no new lines containing words after.
> *So, the last line of a paragraph ends before reaching the very end of the
> line except if there is no new lines containing words after it.* Do you
> follow me ? +And an algorithm could use that pattern to detect properly
> paragraphs.+
> In my opinion, the algorithm should use the following information:
> (*) the +width of the block+ containing the paragraph ;
> (*) precomputed width of the +first word in the next line+.
> The +width of a block+ refers to the width of the area that contains the line
> that contains the character the algorithm is evaluating at any steps.
> The algorithm runs on every character and when it reaches the +last character
> of a line+, it pre computes +the first word of the next line+ to have it's
> width.
> If +this word+ fits in the previous line after the +last character+, then the
> algorithm concludes an end of paragraph (*case 1*).
> If there is no +next word+, then this is also the end of the paragraph (*case
> 2*).
> If the +last character+ is far from the end of the block, we automatically
> conclude for the end of a paragraph (*case 3*).
> Cheers,
> A.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]