[jira] Updated: (PDFBOX-328) PDFTextStripper not handling some Japanese

Oliver Sauder (JIRA) Fri, 10 Dec 2010 05:14:29 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Oliver Sauder updated PDFBOX-328:
---------------------------------

    Attachment: PDFTransform_japanese.pdf
                PDFTransform_japanese_out.txt

> PDFTextStripper not handling some Japanese
> ------------------------------------------
>
>                 Key: PDFBOX-328
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-328
>             Project: PDFBox
>          Issue Type: Bug
>            Priority: Minor
>         Attachments: PDFTransform_japanese.pdf, PDFTransform_japanese_out.txt
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552833&aid=1841058
> Originally submitted by sflaumen on 2007-11-29 07:33.
> Using this code sequence: 
>     PDDocument document = PDDocument.load(stream);
>     PDFTextStripper stripper = new PDFTextStripper();
>     String contents = stripper.getText(document);
> some Japanese documents are handled properly. This is shown by viewing the 
> chars in the String "contents".
> However, other Japanese documents produce garbage non-Japanese characters as 
> viewed in the String contents. 
> The ones that are not handled properly in PDFTextStripper display a prompt 
> when opened in Acrobat Reader which says that a Japanese language support 
> pack needs to be installed to view the document properly. The ones that are 
> handled properly display Japanese characters fine when viewed through Acrobat 
> Reader. Installing the language support pack is not a solution since it would 
> only resolve the display in Acrobat Reader. This code needs to run on a Unix 
> server so even if the support pack would provide help on a PC (unlikely) it 
> would have no affect on the task when run in Unix.
> This appears to be an encoding issue however, unlike similar issues that have 
> been reported, the above code completes successfully. It is just that the 
> results are as described above.
> Attached is an example of a PDF file that is not handled properly by 
> PDFTextStripper and requires a Japanese language pack to view in Acrobat 
> Reader.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552833&aid=1841058&file_id=256615
> JS51ZX3PWT1G.pdf (application/pdf), 84799 bytes
> Not handled properly by PDFTextStripper 
> [comment on SourceForge]
> Originally sent by sflaumen.
> Logged In: YES 
> user_id=1948467
> Originator: YES
> After looking over the code in PDFBox, I would like to suggest that this 
> problem is caused by not having the latest cmap files in the PDFBox cmap 
> folder. Specifically, this folder contains cmap files through the 
> Adobe-Japan1-4 Character Collection. However, additional character 
> collections have been added by Adobe since then. Specifically, they now 
> contain collections for Adobe-Japan1-5 and Adobe-Japan1-6. See Adobe 
> Technical Note #5078. 
> Also, I downloaded the japanese font support pack for Acrobat Reader 8.0 
> which did resolve the display issue for viewing this pdf document. You can 
> find the list of cmap files in the Resources folder for Acrobat after the 
> download. However, copying these into the one for PDFBox did not solve the 
> problem. I think it is because the identity cmap files are missing which are 
> need to do the conversion. See the 00_ReadMe.pdf in the PDFBox cmaps folder. 
> Please let me know if I'm on the right track. This technology is new to me. 
> Thanks, Steve

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-328) PDFTextStripper not handling some Japanese

Reply via email to