Re: Title extraction question in Tika

Nicholas DiPiazza Wed, 21 Apr 2021 08:47:24 -0700

(sorry all, ignore this. was intended to be sent to users list)

On Wed, Apr 21, 2021 at 10:45 AM Nicholas DiPiazza <
[email protected]> wrote:


> Hi Tika Users:
>
> Does Tika have any built-in Title extract logic?
>
> I am currently using a simple algorithm that:
>
> 1) Checks metadata for a title. Use that if there.
> 2) If no title metadata, then use the body text. Extract the first line of
> the body text and use that as the title.
>
> So let's say we have a PDF that has the following body text after parsing
> with tika:
>
> \n\n\n\n\n\n\n\n- 4 -\n\n\nMy document title is here!\n\n...........
>
> That results in
>
> - 4 -
>
> as a title. Not great, right? Ha!
>
> So then I add something like:
>
> 3) If the first line has < 5 alpha num characters, go to the next line
> until you find a title.
>
> That works in this case but doesn't work for many other cases.
>
> What are others doing for title extraction? I would imagine there's no
> perfect solution here. Just curious what ya'll are doing to troubleshoot
> this stuff.
>
> -Nicholas
>

Re: Title extraction question in Tika

Reply via email to