[jira] Resolved: (PDFBOX-846) TextExtraction mixes case of text

JIRA Sat, 16 Oct 2010 10:46:45 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler resolved PDFBOX-846.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.3.0

I fixed the calculation of the space width in revision 1023338. The scaling of 
the text matrix and the ctm wasn't taken into amount before which confused the 
algo to calculate whether a spave has to be added or not.

> TextExtraction mixes case of text
> ---------------------------------
>
>                 Key: PDFBOX-846
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-846
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>         Environment: Windows server, .NET
>            Reporter: Mark Looi
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX846-Menu_WA_032509.pdf, 
> PDFBOX846-Menu_WA_032509.txt
>
>
> Using Text extraction on a file like this, 
> http://www.organictogo.com/pdf/catering/Menu_WA_032509.pdf, the text (in all 
> CAPS) "THAI VEGGIE WRAP" is extracted as:
> "ThAI VeGGIe wRAP". However, examining the PDF, shows that it looks like 
> this: "Thai V eggi e Wrap". The related text on the next lines, such as 
> "Crisp red cabbage, cucumbers, carrots and lettuce with Thai" parse in just 
> fine.
> We are using this code to get the text in C#:
>  byte[] pdfData = myWebClient.DownloadData(pdfUrl);
>                     string text = string.Empty;
>                     ByteArrayInputStream stream = new 
> ByteArrayInputStream(pdfData);
>                     PDDocument doc = PDDocument.load(stream);
>                     PDFTextStripper stripper = new PDFTextStripper();
>                     text = stripper.getText(doc);
>                     doc.close();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-846) TextExtraction mixes case of text

Reply via email to