Signs of corrupt text during the parse -- Was: No Unicode mapping for xx (xx) in font null

Tim Allison Tue, 02 Apr 2019 04:11:44 -0700

Speaking of this, any recommendations on using information from the
per-page parse to figure out if text might be corrupt...without
wrecking PDFBox's API?

https://issues.apache.org/jira/browse/TIKA-2749?focusedCommentId=16807661&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16807661

---------- Forwarded message ---------
From: Giovanni De Stefano (zxxz) <[email protected]>
Date: Tue, Apr 2, 2019 at 4:52 AM
Subject: Re: No Unicode mapping for xx (xx) in font null
To: <[email protected]>
Cc: <[email protected]>

Hello Tim, Peter,

Thank you for your replies.

It seems indeed that the only solution is to include Tesseract in my
processing pipeline.

I don’t know if it might be useful to future readers, but I noticed
that *all* pdf created with PDF24 are subject to this behavior.

I guess this might fall into the “obfuscation” approach some software adopt :-(

Cheers,

Giovanni
On 2 Apr 2019, 04:48 +0200, Peter Murray-Rust <[email protected]>, wrote:

I agree with Tim's analysis.

Many "legacy" fonts (including unfortunately some of those used by LaTeX)
are not mapped onto Unicode. There are two indications (codepoints and
names which can often be used to create a partial mapping. I spent a *lot*
of time doing this manually. For example

WARN No Unicode mapping for .notdef (89) in font null

WARN No Unicode mapping for 90 (90) in font null
<<<
The first field is the name , the second the codepoint. In your example the
font (probably) uses codepoints consistently within that particular font,
e.g. 89 is consistently the same character and different from 90. The names
*may* differentiate characters. Here is my (handedited) entry for CMSY
(used by LaTeX for symbols):

<codePoint unicode="U+00B1" name=".notdef" note="PLUS-MINUS SIGN"/>

But this will only work for this particularly font.

If you are only dealing with anglophone alphanumeric from a single
source/font you can probably work out a table. You are welcome to use mine
(mainly from scientific / technical publishing) Beyond that OCR/Tesseract
may help. (I use it a lot). However maths and non-ISO-LATIN is problematic.
For example distinguishing between the many types of dash/minus/underline
depend on having a system trained on these. Relative heights and size are a
major problem

In general, typesetters and their software are only concerned with the
visual display and frequently use illiteracies (e.g. "=" + backspace + "/"
for "not-equals". Anyone having work typeset in PDF should insist that a
Unicode font is used. Better still avoid PDF.

--
Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Signs of corrupt text during the parse -- Was: No Unicode mapping for xx (xx) in font null

Reply via email to