Hi Tika Users:

Does Tika have any built-in Title extract logic?

I am currently using a simple algorithm that:

1) Checks metadata for a title. Use that if there.
2) If no title metadata, then use the body text. Extract the first line of
the body text and use that as the title.

So let's say we have a PDF that has the following body text after parsing
with tika:

\n\n\n\n\n\n\n\n- 4 -\n\n\nMy document title is here!\n\n...........

That results in

- 4 -

as a title. Not great, right? Ha!

So then I add something like:

3) If the first line has < 5 alpha num characters, go to the next line
until you find a title.

That works in this case but doesn't work for many other cases.

What are others doing for title extraction? I would imagine there's no
perfect solution here. Just curious what ya'll are doing to troubleshoot
this stuff.

-Nicholas

Reply via email to