[ 
https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-846:
--------------------------------------

    Attachment: PDFBOX846-Menu_WA_032509.pdf

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>         Attachments: PDFBOX846-Menu_WA_032509.pdf, 
> PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, 
> http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all 
> CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like 
> this: "Thai V eggi e Wrap". The related text on the next lines, such as 
> "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just 
> fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new 
> ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to