Rens Huizenga created PDFBOX-4293:
-------------------------------------
Summary: PDFBox does not align "columns" properly
Key: PDFBOX-4293
URL: https://issues.apache.org/jira/browse/PDFBOX-4293
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 2.0.11
Environment: Windows 7 64
Reporter: Rens Huizenga
Fix For: 2.0.11
Attachments: PDFconversieTekst CONVERTIO.txt, PDFconversieTekst.pdf,
PDFconversieTekst.pdf.txt, PDFconversieTekst.xlsx
I have to convert Pdf's to database data. I developed a parser that reads .txt
files. The original data is available in PDFs only . Therefore .txt files will
have to be created by Tika converting the PDF's to .txt. After conversion I
recognise an alignment issue with the .txt data compared to the columns in the
PDF. On the TIKA website I read that I need to check if the problems also
occurs in PDFBox, so I checked for that. PDFBox has the same issue.
These lines of PDF data:
a b c d e
a b c d e
are both presented as
a b c d e
in the text file, causing for example numbers to be presented in the wrong
"column".
Unfortunately I cannot share busniess documents, but i have created an example
in Excel, saved it as PDF and converted it to .txt. See attachments.
In addition I converted the testset online with Convertio.co. Their results is
as expected, with enough spaces between the words/numbers to recognise the
column.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]