Speaking of this, any recommendations on using information from the per-page parse to figure out if text might be corrupt...without wrecking PDFBox's API?
https://issues.apache.org/jira/browse/TIKA-2749?focusedCommentId=16807661&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16807661 ---------- Forwarded message --------- From: Giovanni De Stefano (zxxz) <[email protected]> Date: Tue, Apr 2, 2019 at 4:52 AM Subject: Re: No Unicode mapping for xx (xx) in font null To: <[email protected]> Cc: <[email protected]> Hello Tim, Peter, Thank you for your replies. It seems indeed that the only solution is to include Tesseract in my processing pipeline. I don’t know if it might be useful to future readers, but I noticed that *all* pdf created with PDF24 are subject to this behavior. I guess this might fall into the “obfuscation” approach some software adopt :-( Cheers, Giovanni On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <[email protected]>, wrote: I agree with Tim's analysis. Many "legacy" fonts (including unfortunately some of those used by LaTeX) are not mapped onto Unicode. There are two indications (codepoints and names which can often be used to create a partial mapping. I spent a *lot* of time doing this manually. For example WARN No Unicode mapping for .notdef (89) in font null WARN No Unicode mapping for 90 (90) in font null <<< The first field is the name , the second the codepoint. In your example the font (probably) uses codepoints consistently within that particular font, e.g. 89 is consistently the same character and different from 90. The names *may* differentiate characters. Here is my (handedited) entry for CMSY (used by LaTeX for symbols): <codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/> But this will only work for this particularly font. If you are only dealing with anglophone alphanumeric from a single source/font you can probably work out a table. You are welcome to use mine (mainly from scientific / technical publishing) Beyond that OCR/Tesseract may help. (I use it a lot). However maths and non-ISO-LATIN is problematic. For example distinguishing between the many types of dash/minus/underline depend on having a system trained on these. Relative heights and size are a major problem In general, typesetters and their software are only concerned with the visual display and frequently use illiteracies (e.g. "=" + backspace + "/" for "not-equals". Anyone having work typeset in PDF should insist that a Unicode font is used. Better still avoid PDF. -- Peter Murray-Rust Reader Emeritus in Molecular Informatics Unilever Centre, Dept. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069 --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
