[jira] Commented: (PDFBOX-654) Extracting CJK text

Atsuo Ishimoto (JIRA) Sat, 13 Mar 2010 02:48:57 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844854#action_12844854
 ]


Atsuo Ishimoto commented on PDFBOX-654:
---------------------------------------

Thank you for the file. I cannot read Chinese, but characters looks
being extracted correctly for me. Could you be more specific about the
problem you found?

> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: China.pdf, identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font 
> doesn't supply TO_UNICODE table, then encoding name is generated from CID 
> information (Registry and Ordering). This idea is borrowed from pdfminer[1], 
> an another PDF library written in Python. I don't see any test failures with 
> this patch.
> I have published this patch last year[2], and got some good feedbacks from 
> Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-654) Extracting CJK text

Reply via email to