Yuri Burrows created PDFBOX-1874:
------------------------------------
Summary: PDFTextStripper.isParagraphSeparation(...)
Key: PDFBOX-1874
URL: https://issues.apache.org/jira/browse/PDFBOX-1874
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.3
Environment: Eclipse
Reporter: Yuri Burrows
Priority: Minor
PDFTextStripper.isParagraphSeparation(...) seems to have an issue with how it
finds Y text indentation.
PROBLEM:
I believe the issue is due to precision in the the following logic:
float yGap = Math.abs(position.getTextPosition().getYDirAdj()-
lastPosition.getTextPosition().getYDirAdj());
float xGap = (position.getTextPosition().getXDirAdj()-
lastLineStartPosition.getTextPosition().getXDirAdj());
if(yGap > (getDropThreshold()*maxHeightForLine))
{
result = true;
yGap has a precision to 1000th+, while (getDropThreshold()*maxHeightForLine)
has a precision to 100,000th. Resulting in the following comparison (example):
16.018 > 16.018005
which evaluates to "True". However 16.018 > 16.018 would evaluate to "False".
EFFECT OF THE PROBLEM:
every line in the output is marked as "isParagraphStart = true" and
"writeParagraphEnd() ... = true".
I.E.
|||NEW_LINE|||
|||PARAGRAPH_START|||PDFBox has been designed to represent PDF documents using
familiar object-oriented paradigms. The data|||NEW_LINE|||
contained in a PDF document is a collection of basic object types: arrays,
booleans, dictionaries, numbers,|||NEW_LINE|||
|||PARAGRAPH_END||||||NEW_LINE|||
|||PARAGRAPH_START|||strings and binary streams. PDFBox captures these basic
object types in the org.pdfbox.cos package (the|||NEW_LINE|||
COS Model). While it's possible to create any desired interactions with a PDF
document using only these|||NEW_LINE|||
|||PARAGRAPH_END||||||NEW_LINE|||
In the source PDF these lines appear as such:
"PDFBox has been designed to represent PDF documents using familiar
object-oriented paradigms. The data
contained in a PDF document is a collection of basic object types: arrays,
booleans, dictionaries, numbers,
strings and binary streams. PDFBox captures these basic object types in the
org.pdfbox.cos package (the
COS Model). While it's possible to create any desired interactions with a PDF
document using only these"
MY WORKAROUND:
NOTE: there is a small performance hit with this workaround.
float yGap = Math.abs(position.getTextPosition().getYDirAdj()
- lastPosition.getTextPosition().getYDirAdj());
DecimalFormat df = new DecimalFormat("#.00");
float yGapTruncated = Float.valueOf(df.format(yGap));
float newYVal = Float.valueOf(df.format(getDropThreshold()
* maxHeightForLine));
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)