Re: [iText-questions] extracting text from pdfs with japanese data

Kevin Day Mon, 15 Dec 2008 08:51:10 -0800

I ran these files through com.lowagie.text.pdf.parser.PdfContentReaderTool and I actually see the tokeniser fail on the first, then the font read fail on the second.

Here's the exception from content1.pdf:

Exception in thread "main" ExceptionConverter: java.io.IOException: '>' not expected at file pointer 39040

I suspect the issue with content1.pdf is that the encoding on the file itself is not something that is built into standard Java?? I'm not entirely sure on how this sort of thing gets handled, but the PDF file is processed byte-by-byte, so there is no character set transformation going on... I'd have to hear other people's opinion on this.

Exception from tic_dogu2.pdf:

Exception in thread "main" java.lang.NullPointerException

at com.lowagie.text.pdf.PdfReader.getStreamBytes(PdfReader.java:2089)

This one is happening because the font resource can not be recovered from the file (the font isn't embedded). This means that font metrics and CMap info would have to be recovered from an external file (no idea how to do this - it may be as simple as reading a CMap from an external source). One thing that I note is that this file has no ToUnicode entry in any of the font references, which definitely implies that reading CMap from an external file would be necessary.

I believe that this would involve an adjustment to the DocumentFont to have it get the ToUnicode map from an external source if it isn't specified in the PDF itself. This may also require adjustment to the CMapAwareDocumentFont class. Probably addition of a method to DocumentFont called getToUnicodeBytes() that has the additional logic. Of course if we are doing surgery in that area, we should probably make adjustments to fillMetrics so it uses a CMap object directly (instead of the toUnicode byte array) - in which case the method in DocumentFont should be getCMap() (which would be a lot more object oriented, don't you think? :-) ).

At this stage, I think we need to get input from other folks so we can figure out how to proceed.

- K

----------------------- Original Message -----------------------

From: "Hoppe, Michael" <michael.ho...@fiz-karlsruhe.de>

To: <itext-questions@lists.sourceforge.net>

Cc:

Date: Mon, 15 Dec 2008 13:45:47 +0100

Subject: [iText-questions] extracting text from pdfs with japanese data

Dear all,

My name is Michael Hoppe, i work for the eSciDoc-Project that is funded by the german ministery of education and research (http://www.escidoc.org) . My part in the project is the search and indexing component where we index metadata and fulltexts in pdf. For the indexing we need to extract the text out of the pdf, using iText. I now have problems extracting the text from japanese pdfs where the font is not embedded. I either get grumbled data or an exception that says ‘encoding not supported EUC-H’. Does anyone have an idea how to get the correct text for Japanese document with font not embedded? Two pdfs are attached.

Thanks in advance

M.Hoppe

Code Snippet:

try {

PdfReader reader = new PdfReader(inputFile);

PRTokeniser token;

StringBuilder builder = new StringBuilder();

for (int i = 1; i <= reader.getNumberOfPages(); i++) {

byte[] pageBytes = reader.getPageContent(i);

if (pageBytes != null) {

token = new PRTokeniser(pageBytes);

while (true) {

try {

if (!token.nextToken()) {

break;

}

if (token.getTokenType() == PRTokeniser.TK_STRING) {

builder.append(token.getStringValue());

}

} catch (Exception e) {

System.out.println(e);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

Dr. Michael Hoppe
ePublishing & eScience
Development & Applied Research
Phone +49 7247 808-251
Fax +49 7247 808-133
michael.ho...@fiz-karlsruhe.de

FIZ Karlsruhe
Hermann-von-Helmholtz-Platz 1
76344 Eggenstein-Leopoldshafen, Germany

www.fiz-karlsruhe.de

-------------------------------------------------------

Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische Information mbH. 
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 101892. 
Geschäftsführerin: Sabine Brünger-Weilandt. 
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions


Buy the iText book: http://www.1t3xt.com/docs/book.php

Re: [iText-questions] extracting text from pdfs with japanese data

Reply via email to