Re: Extract text from asian PDF document

Andreas Lehmkühler Mon, 28 Sep 2009 10:10:14 -0700

Hi,

Bernd Engelhardt schrieb:
> Hi,
> I am trying to extract some text content from a PDF file. If I use a PDF file 
> with western content everything works perfect. If I try to do the same with a 
> PDF file, which contains some asian characters, I get an exception (see 
> below). As far as I can see is the cmap "UniJIS-UCS2-H" in the 
> "Resources/cmap" folder. Do I have to load the cmap or is this map 
> automatically loaded? Does PdfBox supports asian languages? What have I to do 
> to support such languages? Any hint is welcome. Thanks
I'm afraid there are still some issues concerning asian mappings. See
[1] and [2] for further details.


BR
Andreas Lehmkühler


[1] https://issues.apache.org/jira/browse/PDFBOX-509
[2] https://issues.apache.org/jira/browse/PDFBOX-420

Re: Extract text from asian PDF document

Reply via email to