[
https://issues.apache.org/jira/browse/PDFBOX-3804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexandre updated PDFBOX-3804:
------------------------------
Description:
Hi,
To extract text by paragraphs is probably the most looking forward improvement
asked by PDFBox users.
*The current text extraction approach detects correctly end of lines. But it
does not detect correctly end of paragraphs.*
*What is a paragraph ?* A paragraph is a text that contains one or several
sentences. It can start by a tabulation but this is not mandatory. In a
paragraph, there is one or more lines but there is no carriage return (except
the one at the very end). A paragraph can end before the very end of a line,
but some paragraphs end at the very end. If a paragraph ends at the very end
there is no new lines containing words after.
*So, the last line of a paragraph ends before reaching the very end of the line
except if there is no new lines containing words after it.* Do you follow me ?
+And an algorithm could use that pattern to detect properly paragraphs.+
In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.
The width of a block is either the width of the pdf minus left and right
margins if the pdf has one column. Or the width of a block is the width of a
column if the pdf has two columns for example.
The algorithm runs on every character and when it reaches the +last character
of a line+, it pre computes +the first word of the next line+ to have it's
width.
If +this word+ fits in the previous line after the +last character+, then the
algorithm concludes an end of paragraph (*case 1*).
If there is no +next word+, then this is also the end of the paragraph (*case
2*).
If the +last character+ is far from the end of the block, we automatically
conclude for the end of a paragraph (*case 3*).
Cheers,
A.
was:
Hi,
To extract text by paragraphs is probably the most looking forward improvement
asked by PDFBox users.
*The current text extraction approach detects correctly end of lines. But it
does not detect correctly end of paragraphs.*
*What is a paragraph ?* A paragraph is a text that contains one or several
sentences. It can start by a tabulation but this is not mandatory. In a
paragraph, there is one or more lines but there is no carriage return (except
the one at the very end). A paragraph can end before the very end of a line,
but some paragraphs end at the very end. If a paragraph ends at the very end
there is no new lines containing words after.
*So, the last line of a paragraph ends before reaching the very end of the line
except if there is no new lines containing words after it.* Do you follow me ?
+And an algorithm could use that pattern to detect properly paragraphs.+
In my opinion, the algorithm never adds carriage return except when a end of
paragraph is detect.
In my opinion, the algorithm should use the following information:
(*) the width of the block containing the paragraph ;
(*) detect the padding and margin of that block ;
(*) precomputed width of the next word.
The width of a block is either the width of the pdf minus left and right
margins if the pdf has one column. Or the width of a block is the width of a
column if the pdf has two columns for example.
The algorithm runs on every character and when it reaches the +last character
of a line+, it pre computes +the first word of the next line+ to have it's
width.
If +this word+ fits in the previous line after the +last character+, then the
algorithm concludes an end of paragraph (*case 1*).
If there is no +next word+, then this is also the end of the paragraph (*case
2*).
If the +last character+ is far from the end of the block, we automatically
conclude for the end of a paragraph (*case 3*).
Cheers,
A.
> Detect end of paragraphs
> ------------------------
>
> Key: PDFBOX-3804
> URL: https://issues.apache.org/jira/browse/PDFBOX-3804
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 2.0.6, 2.0.7, 3.0.0
> Reporter: Alexandre
> Priority: Minor
> Labels: extraction, paragraph
> Attachments: example.pdf
>
>
> Hi,
> To extract text by paragraphs is probably the most looking forward
> improvement asked by PDFBox users.
> *The current text extraction approach detects correctly end of lines. But it
> does not detect correctly end of paragraphs.*
> *What is a paragraph ?* A paragraph is a text that contains one or several
> sentences. It can start by a tabulation but this is not mandatory. In a
> paragraph, there is one or more lines but there is no carriage return (except
> the one at the very end). A paragraph can end before the very end of a line,
> but some paragraphs end at the very end. If a paragraph ends at the very end
> there is no new lines containing words after.
> *So, the last line of a paragraph ends before reaching the very end of the
> line except if there is no new lines containing words after it.* Do you
> follow me ? +And an algorithm could use that pattern to detect properly
> paragraphs.+
> In my opinion, the algorithm should use the following information:
> (*) the width of the block containing the paragraph ;
> (*) detect the padding and margin of that block ;
> (*) precomputed width of the next word.
> The width of a block is either the width of the pdf minus left and right
> margins if the pdf has one column. Or the width of a block is the width of a
> column if the pdf has two columns for example.
> The algorithm runs on every character and when it reaches the +last character
> of a line+, it pre computes +the first word of the next line+ to have it's
> width.
> If +this word+ fits in the previous line after the +last character+, then the
> algorithm concludes an end of paragraph (*case 1*).
> If there is no +next word+, then this is also the end of the paragraph (*case
> 2*).
> If the +last character+ is far from the end of the block, we automatically
> conclude for the end of a paragraph (*case 3*).
> Cheers,
> A.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]