TextExtraction mixes case of text
---------------------------------

                 Key: PDFBOX-846
                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.2.1
         Environment: Windows server, .NET
            Reporter: Mark Looi


Using Text extraction on a file like this, 
http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all 
CAPS) "THAI VEGGIE WRAP" is extracted as:
"ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like this: 
"Thai V eggi e Wrap". The related text on the next lines, such as "Crisp red 
cabbage, cucumbers, carrots and lettuce with Thai" parse in just fine.

We are using this code to get the text in C#:

 byte[] pdfData = myWebClient.DownloadData(pdfUrl);
                    string text = string.Empty;

                    ByteArrayInputStream stream = new 
ByteArrayInputStream(pdfData);
                    PDDocument doc = PDDocument.load(stream);
                    PDFTextStripper stripper = new PDFTextStripper();
                    text = stripper.getText(doc);
                    doc.close();


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to