Re: Tika - handling custom encoding PDF files

Nicholas DiPiazza Mon, 21 Jun 2021 20:10:55 -0700

Great thanks everyone! I appreciate your responses.

Yeah sounds like this is definitely possible if we get really desperate.
But very non-trivial.


-Nicholas

On Mon, Jun 21, 2021 at 9:35 PM Robert Muir <[email protected]> wrote:

> On Mon, Jun 21, 2021 at 3:18 PM Nicholas DiPiazza <
> [email protected]> wrote:
>
> > Let's say we have a PDF with a bunch of custom encodings. they would look
> > like this in your Font Properties:
> >
> > [image: image.png]
> >
> > Notice those with "encoding: custom".
> >
> > So even though the PDF has normal looking hebrew text such as:
> >
> > [image: image.png]
> > When you copy it to clipboard it looks like this:
> >
> > ©°³ ž ³ž©¤³
> >
> > That's because the custom encoding does not actually map to UTF-8
> > characters.
> >
> > Has anyone heard of a way to magically process these custom encodings to
> > find a reasonable UTF-8 mapping?
> >
> >
> I've done this by opening font in fontforge, so you can see the glyph
> table, and mapping each glyph to proper unicode sequences.
> In your hebrew case, you'd need additional processing beyond that: because
> PDF glyphs will be in visual order but unicode needs to be in logical
> order. So if you just "map" and don't reorder you will end out with
> backwards text.
> It is very annoying if there are a lot of ligatures, complex writing
> systems, or both.
>

Re: Tika - handling custom encoding PDF files

Reply via email to