Hi,
i found something interesting that can from the long term view improve the nutch result very much from my understanding.
I heard in a talk that google takes the _first_ 100kb of a page. As far i know nutch download only pages that are <= 100kb.
That is a big different!
As far as i know from a linguistically point of view the most informations are in the beginning of a text.
As far as i know navigation links are in top of the page as well.
To change that wouldn't be that easy since most content parser need the complete 'file' for correct processing it.
Any comments?
Stefan
---------------------------------------------------------------
enterprise information technology consulting
open technology: http://www.media-style.com
open source: http://www.weta-group.net
open discussion: http://www.text-mining.org
------------------------------------------------------- This SF.Net email is sponsored by the new InstallShield X.
From Windows to Linux, servers to mobile, InstallShield X is the
one installation-authoring solution that does it all. Learn more and evaluate today! http://www.installshield.com/Dev2Dev/0504 _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
