[
https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-846:
--------------------------------------
Attachment: PDFBOX846-Menu_WA_032509.pdf
> TextExtraction mixes case of text
> ---------------------------------
>
> Key: PDFBOX-846
> URL: https://issues.apache.org/jira/browse/PDFBOX-846
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.1
> Environment: Windows server, .NET
> Reporter: Mark Looi
> Attachments: PDFBOX846-Menu_WA_032509.pdf,
> PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this,
> http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all
> CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like
> this: "Thai V eggi e Wrap". The related text on the next lines, such as
> "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just
> fine.
> We are using this code to get the text in C#:
> byte[] pdfData = myWebClient.DownloadData(pdfUrl);
> string text = string.Empty;
> ByteArrayInputStream stream = new
> ByteArrayInputStream(pdfData);
> PDDocument doc = PDDocument.load(stream);
> PDFTextStripper stripper = new PDFTextStripper();
> text = stripper.getText(doc);
> doc.close();
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.