[
https://issues.apache.org/jira/browse/PDFBOX-328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Oliver Sauder updated PDFBOX-328:
---------------------------------
Attachment: PDFTransform_japanese.pdf
PDFTransform_japanese_out.txt
> PDFTextStripper not handling some Japanese
> ------------------------------------------
>
> Key: PDFBOX-328
> URL: https://issues.apache.org/jira/browse/PDFBOX-328
> Project: PDFBox
> Issue Type: Bug
> Priority: Minor
> Attachments: PDFTransform_japanese.pdf, PDFTransform_japanese_out.txt
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552833&aid=1841058
> Originally submitted by sflaumen on 2007-11-29 07:33.
> Using this code sequence:
> PDDocument document = PDDocument.load(stream);
> PDFTextStripper stripper = new PDFTextStripper();
> String contents = stripper.getText(document);
> some Japanese documents are handled properly. This is shown by viewing the
> chars in the String "contents".
> However, other Japanese documents produce garbage non-Japanese characters as
> viewed in the String contents.
> The ones that are not handled properly in PDFTextStripper display a prompt
> when opened in Acrobat Reader which says that a Japanese language support
> pack needs to be installed to view the document properly. The ones that are
> handled properly display Japanese characters fine when viewed through Acrobat
> Reader. Installing the language support pack is not a solution since it would
> only resolve the display in Acrobat Reader. This code needs to run on a Unix
> server so even if the support pack would provide help on a PC (unlikely) it
> would have no affect on the task when run in Unix.
> This appears to be an encoding issue however, unlike similar issues that have
> been reported, the above code completes successfully. It is just that the
> results are as described above.
> Attached is an example of a PDF file that is not handled properly by
> PDFTextStripper and requires a Japanese language pack to view in Acrobat
> Reader.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552833&aid=1841058&file_id=256615
> JS51ZX3PWT1G.pdf (application/pdf), 84799 bytes
> Not handled properly by PDFTextStripper
> [comment on SourceForge]
> Originally sent by sflaumen.
> Logged In: YES
> user_id=1948467
> Originator: YES
> After looking over the code in PDFBox, I would like to suggest that this
> problem is caused by not having the latest cmap files in the PDFBox cmap
> folder. Specifically, this folder contains cmap files through the
> Adobe-Japan1-4 Character Collection. However, additional character
> collections have been added by Adobe since then. Specifically, they now
> contain collections for Adobe-Japan1-5 and Adobe-Japan1-6. See Adobe
> Technical Note #5078.
> Also, I downloaded the japanese font support pack for Acrobat Reader 8.0
> which did resolve the display issue for viewing this pdf document. You can
> find the list of cmap files in the Resources folder for Acrobat after the
> download. However, copying these into the one for PDFBox did not solve the
> problem. I think it is because the identity cmap files are missing which are
> need to do the conversion. See the 00_ReadMe.pdf in the PDFBox cmaps folder.
> Please let me know if I'm on the right track. This technology is new to me.
> Thanks, Steve
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.