Re: [Nutch-dev] page size

john Sat, 12 Jun 2004 21:53:12 -0700

On Sat, Jun 12, 2004 at 09:35:54PM +0200, Stefan Groschupf wrote:
> Hi,
> 
> i found something interesting that can from the long term view improve 
> the nutch result very much from my understanding.
> I heard in a talk that google takes the _first_ 100kb of a page. As far


Maybe someone can plant a few files of various sizes in his/her site,
to see how google fetches.

> i know nutch download only pages that are <= 100kb.
> That is a big different!

I believe, by default, nutch truncates files at 64*1024 bytes
for http:// type of resources. But this is configurable.
Should be similar to what google is doing?

> 
> As far as i know from a linguistically point of view the most 
> informations are in the beginning of a text.

Any evidence to support this? Referred to html pages?

> As far as i know navigation links are in top of the page as well.
> 
> To change that wouldn't be that easy since most content parser need the 
> complete 'file' for correct processing it.

This is true for many existing (open source) content parsers.
But it should be possible to write ones that can handle
incomplete contents. However, not sure if worth to do so.

> Any comments?

Shouldn't page size be an operational issue?
Unless nutch chooses to become an operating entity.

John


-------------------------------------------------------
This SF.Net email is sponsored by the new InstallShield X.
>From Windows to Linux, servers to mobile, InstallShield X is the
one installation-authoring solution that does it all. Learn more and
evaluate today! http://www.installshield.com/Dev2Dev/0504
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] page size

Reply via email to