[Nutch-general] Re: Pdf document title in nutch search

Jérôme Charron Mon, 20 Feb 2006 15:36:03 -0800

> It'd be nice if this was changed so that if a PDF has no title then the
> first xx words become the new title.


I agree with that.
Please create a JIRA issue for this point.


> (but it seems that the Google title process is more advanced that this)

Really?
Take a look at this :
http://www.google.com/search?num=100&hl=fr&safe=off&c2coff=1&as_qdr=all&q=http%3A%2F%2Fwww.trellix.com%2Fproducts%2Fdownloads%2Fsearchengines_siteopt.pdf++++&btnG=Rechercher&lr=
In fact Google always take the first characters of the document as the
title.
Google never use the Title property of the document.
So, when there is some shaded characters in the first characters of the pdf
document, you get a "TTTiiitttllleee llliiikkkeee ttthhhaaattt" ... is it
really an advanced title processing?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

[Nutch-general] Re: Pdf document title in nutch search

Reply via email to