Re: Tika - handling custom encoding PDF files

Robert Muir Mon, 21 Jun 2021 19:35:39 -0700

On Mon, Jun 21, 2021 at 3:18 PM Nicholas DiPiazza <
[email protected]> wrote:


> Let's say we have a PDF with a bunch of custom encodings. they would look
> like this in your Font Properties:
>
> [image: image.png]
>
> Notice those with "encoding: custom".
>
> So even though the PDF has normal looking hebrew text such as:
>
> [image: image.png]
> When you copy it to clipboard it looks like this:
>
> ©°³ ž ³ž©¤³
>
> That's because the custom encoding does not actually map to UTF-8
> characters.
>
> Has anyone heard of a way to magically process these custom encodings to
> find a reasonable UTF-8 mapping?
>
>
I've done this by opening font in fontforge, so you can see the glyph
table, and mapping each glyph to proper unicode sequences.
In your hebrew case, you'd need additional processing beyond that: because
PDF glyphs will be in visual order but unicode needs to be in logical
order. So if you just "map" and don't reorder you will end out with
backwards text.
It is very annoying if there are a lot of ligatures, complex writing
systems, or both.

Re: Tika - handling custom encoding PDF files

Reply via email to