[Nutch-general] Re: Pdf document title in nutch search

Håvard W. Kongsgård Tue, 21 Feb 2006 03:26:01 -0800

Take a look at the Google search result of this rand publication
http://www.google.com/search?hs=z0n&hl=en&lr=&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&q=Implementing+Security+Improvement+Options+at+Los+Angeles+International+Airport+&btnG=Search

The pdf document (RAND_DB468-1.sum.pdf) has no pdf title, and googledon't use the first 2 pages of the document for a title!




Jérôme Charron wrote:

It'd be nice if this was changed so that if a PDF has no title then the
first xx words become the new title.


I agree with that.
Please create a JIRA issue for this point.

(but it seems that the Google title process is more advanced that this)


Really?
Take a look at this :
http://www.google.com/search?num=100&hl=fr&safe=off&c2coff=1&as_qdr=all&q=http%3A%2F%2Fwww.trellix.com%2Fproducts%2Fdownloads%2Fsearchengines_siteopt.pdf++++&btnG=Rechercher&lr=
In fact Google always take the first characters of the document as the
title.
Google never use the Title property of the document.
So, when there is some shaded characters in the first characters of the pdf
document, you get a "TTTiiitttllleee llliiikkkeee ttthhhaaattt" ... is it
really an advanced title processing?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Pdf document title in nutch search

Reply via email to