TextExtraction mixes case of text
---------------------------------
Key: PDFBOX-846
URL: https://issues.apache.org/jira/browse/PDFBOX-846
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.2.1
Environment: Windows server, .NET
Reporter: Mark Looi
Using Text extraction on a file like this,
http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all
CAPS) "THAI VEGGIE WRAP" is extracted as:
"ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this:
"Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red
cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.
We are using this code to get the text in C#:
byte[] pdfData = myWebClient.DownloadData(pdfUrl);
string text = string.Empty;
ByteArrayInputStream stream = new
ByteArrayInputStream(pdfData);
PDDocument doc = PDDocument.load(stream);
PDFTextStripper stripper = new PDFTextStripper();
text = stripper.getText(doc);
doc.close();
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.