[
https://issues.apache.org/jira/browse/PDFBOX-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-2584.
-----------------------------------
Resolution: Cannot Reproduce
Closing due to no reaction. You can still comment or reopen if needed.
> Text extraction reports zero character widths
> ----------------------------------------------
>
> Key: PDFBOX-2584
> URL: https://issues.apache.org/jira/browse/PDFBOX-2584
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.8
> Reporter: Pavel Misurkin
> Attachments: stip_2c.pdf
>
>
> We are using PDFBox API to get position of characters within a document
> Have found a problem with one document:: text extraction properly extracting
> text but set all character's width to zero
> Code is pretty simple
> {code}
> File input = new File("stip_2c.pdf");
> document = PDDocument.load(input);
>
> PDFTextStripper extractor = new PDFTextStripper();
> Writer output = new StringWriter();
> extractor.writeText(document, output);
> {code}
> We are examining then value of Extractor.charactersByArticle member for
> characters widths
> - Have found the issue in 1.8.4
> all chars widths were == zero
> - in version 1.8.8
> all chars widths were == zero except whitespaces.
> See new validation added in 1.8.8
> File
> pdfbox-1.8.8-src\pdfbox\src\main\java\org\apache\pdfbox\util\PDFStreamEngine.java
> line 369
> {code} if (spaceWidthText == 0)
> {
> spaceWidthText = 1.0f; // if could not find font, use a generic
> value
> } {code}
> - in version 2.0.0 problem still exists
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]