[Nutch-dev] question/suggestion on nutch file format

Tom Mon, 16 Jan 2006 20:54:19 -0800

Hi,

After taking some time to look into the nutch source
code (v0.7.1), I notice that the current file format
for storing page content may not be very efficient.


If I understand correctly, to retrieve the content of
a page with a docID, say, 20, the code check the
"index" file first, since the default indexInterval is
128, so it starts from 0 and then loops for 20 times,
with reading and comparing the header data in each
loop. 

IMO, this way of loading the content is not very
efficient. Here're my questions and suggestions

* Why not storing different pages in different files,
whose file names are their docID? And each file is
compressed with gzip. It will create lots of small
files. But loading them would be faster.

* If files need to be appended, is it a better way to
use the same size (6k for example) for each page? If
the page is larger than 6k, the rest of its content
will be stored in another file. The searching will be
much faster. 


Thanks.

Regards,

Tom




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] question/suggestion on nutch file format

Reply via email to