(sorry all, ignore this. was intended to be sent to users list) On Wed, Apr 21, 2021 at 10:45 AM Nicholas DiPiazza < [email protected]> wrote:
> Hi Tika Users: > > Does Tika have any built-in Title extract logic? > > I am currently using a simple algorithm that: > > 1) Checks metadata for a title. Use that if there. > 2) If no title metadata, then use the body text. Extract the first line of > the body text and use that as the title. > > So let's say we have a PDF that has the following body text after parsing > with tika: > > \n\n\n\n\n\n\n\n- 4 -\n\n\nMy document title is here!\n\n........... > > That results in > > - 4 - > > as a title. Not great, right? Ha! > > So then I add something like: > > 3) If the first line has < 5 alpha num characters, go to the next line > until you find a title. > > That works in this case but doesn't work for many other cases. > > What are others doing for title extraction? I would imagine there's no > perfect solution here. Just curious what ya'll are doing to troubleshoot > this stuff. > > -Nicholas >
