Hi,
I am trying to extract some text content from a PDF file. If I use a PDF file
with western content everything works perfect. If I try to do the same with a
PDF file, which contains some asian characters, I get an exception (see below).
As far as I can see is the cmap "UniJIS-UCS2-H" in the "Resources/cmap" folder.
Do I have to load the cmap or is this map automatically loaded? Does PdfBox
supports asian languages? What have I to do to support such languages? Any hint
is welcome. Thanks
Regards
Bernd
28.09.2009 13:45:55 org.apache.pdfbox.util.PDFStreamEngine processOperator
WARNUNG: java.io.IOException: Unknown encoding for 'UniJIS-UCS2-H'
java.io.IOException: Unknown encoding for 'UniJIS-UCS2-H'
at
org.apache.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.java:68)
at org.apache.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:566)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:439)
at
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:343)
at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:50)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:516)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:229)
at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:70)
at
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:516)
at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:229)
at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
at de.softvision.job.Job.getContentFromPDF(Job.java:264)
at de.softvision.job.Job.loadPDF(Job.java:184)
at invoiceclearing.Main.main(Main.java:30)