> On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur > <[email protected]> wrote: > > I am using the PDFBox for one of the application. What I am > doing is I > > am extracting the PDF text from the PDF and generating the TOC > > entries. But I am facing one problem, that is, if the PDF contains > > these two characters "✠"(✠) and "Ⓔ"(Ⓔ) then the > > processpage(PDPage, > > COSStream) gives an IOException "Unknown encoding for > 'UniJIS-UCS2-H' ". Can > > you let us know is there any way as to overcome this problem? > > Unfortunately not. Unless someone else has a good answer, > you'll probably need to look at the relevant source code in > PDFBox to figure out what to do with this. If you do that, > we'd be happy to apply any fix you may come up with. I'm haven't a better answer than Jukka, but perhaps a hint were to look for the solution. As far as I understand, the are several unicode-mappings defined in Resources/cmap. You have to look, if the 2 characters you mentioned above are part of the mapping-table "UniJIS-UCS2-H". If not, the question will be: is there a problem with the mapping-file or with the document-producing software.
HTH Andreas ---------------------------------------------------------------- Vorsitzender des Aufsichtsrates: Alwin Fitting Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), Stefan Niehusmann Sitz der Gesellschaft: Dortmund Eingetragen beim Amtsgericht Dortmund Handelsregister-Nr. HR B 21222 USt.-IdNr. DE 2588 96 719
