[jira] Created: (PDFBOX-654) Extracting CJK text

Atsuo Ishimoto (JIRA) Tue, 09 Mar 2010 22:37:55 -0800

Extracting CJK text
-------------------

                 Key: PDFBOX-654
                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
            Reporter: Atsuo Ishimoto



This is an update for PDFBOX-420 filed by Takashi Komatsubara.

In this patch, if "Identity-H" is used as encoding of font and the font doesn't 
supply TO_UNICODE table, then encoding name is generated from CID information 
(Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF 
library written in Python. I don't see any test failures with this patch.

I have published this patch last year[2], and got some good feedbacks from 
Japanese users[3].

[1] http://www.unixuser.org/~euske/python/pdfminer/index.html
[2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
    https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
[3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-654) Extracting CJK text

Reply via email to