[jira] Issue Comment Edited: (PDFBOX-654) Extracting CJK text

Takashi Komatsubara (JIRA) Fri, 12 Mar 2010 05:44:52 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844496#action_12844496
 ]


Takashi Komatsubara edited comment on PDFBOX-654 at 3/12/10 1:42 PM:
---------------------------------------------------------------------

Hi Andreas san and Atsuo san,

I have tested Atsuo san's patch and confirmed that his patch passed the maven 
test.
Also I have successfully extracted Japanese text from many pdf files.
Currently, his patch is the highest quality of exporting text from Japanese PDF 
files.

Unfortunately, I have tested with Chinese pdf files with his patch.
The result is not good. Chinese handling seems to be using different type 
implemented within pdf file.

As one of Japanese pdfbox developer, I would like you guys to include Japanese 
pdf files for the maven testing,



      was (Author: takashi-smi):
    Hi Andreas san and Atsuo san,

I have tested Atsuo san's patch and confirmed that his patch passed the maven 
test.
Also I have successfully extract Japanese text from many pdf files.
Currently, his patch is the highest quality of exporting text from Japanese PDF 
files.

Unfortunately, I have tested with Chinese pdf files with his patch.
The result is not good. Chinese handling seems to be using different type 
implemented within pdf file.

As one of Japanese pdfbox developer, I would like you guys to include Japanese 
pdf files for the maven testing,


  
> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font 
> doesn't supply TO_UNICODE table, then encoding name is generated from CID 
> information (Registry and Ordering). This idea is borrowed from pdfminer[1], 
> an another PDF library written in Python. I don't see any test failures with 
> this patch.
> I have published this patch last year[2], and got some good feedbacks from 
> Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-654) Extracting CJK text

Reply via email to