Fwd: Title extraction question in Tika

Nicholas DiPiazza Wed, 21 Apr 2021 08:45:44 -0700

Hi Tika Users:

Does Tika have any built-in Title extract logic?


I am currently using a simple algorithm that:

1) Checks metadata for a title. Use that if there.
2) If no title metadata, then use the body text. Extract the first line of
the body text and use that as the title.

So let's say we have a PDF that has the following body text after parsing
with tika:

\n\n\n\n\n\n\n\n- 4 -\n\n\nMy document title is here!\n\n...........

That results in

- 4 -

as a title. Not great, right? Ha!

So then I add something like:

3) If the first line has < 5 alpha num characters, go to the next line
until you find a title.

That works in this case but doesn't work for many other cases.

What are others doing for title extraction? I would imagine there's no
perfect solution here. Just curious what ya'll are doing to troubleshoot
this stuff.

-Nicholas

Fwd: Title extraction question in Tika

Reply via email to