[jira] Resolved: (PDFBOX-654) Extracting CJK text

JIRA Wed, 10 Mar 2010 10:17:49 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler resolved PDFBOX-654.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1.0

WOW, that's really a great improvement. I've applied the patch with version 
921494. All text extract tests are still working. As a test for the patch I've 
extracted the text from the document attached to PDFBOX-420. I'm not really 
able to read the result, but I've just compared the "pictures" from the 
textfile with those displayed in acrobat and it looks great.

Thanks to  Atsuo for the contribution. 

> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font 
> doesn't supply TO_UNICODE table, then encoding name is generated from CID 
> information (Registry and Ordering). This idea is borrowed from pdfminer[1], 
> an another PDF library written in Python. I don't see any test failures with 
> this patch.
> I have published this patch last year[2], and got some good feedbacks from 
> Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-654) Extracting CJK text

Reply via email to