[ 
https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-846:
--------------------------------------

    Attachment: PDFBOX846-Menu_WA_032509.txt

First of all you should activate the sorting using this 
stripper.setSortByPosition(true). 

I'm attaching the extracting result of the current trunk version (1003396). It 
looks quite good but there are still some extra spaces.

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>         Attachments: PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, 
> http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all 
> CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like 
> this: "Thai V eggi e Wrap". The related text on the next lines, such as 
> "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just 
> fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new 
> ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to