Hi, After taking some time to look into the nutch source code (v0.7.1), I notice that the current file format for storing page content may not be very efficient.
If I understand correctly, to retrieve the content of a page with a docID, say, 20, the code check the "index" file first, since the default indexInterval is 128, so it starts from 0 and then loops for 20 times, with reading and comparing the header data in each loop. IMO, this way of loading the content is not very efficient. Here're my questions and suggestions * Why not storing different pages in different files, whose file names are their docID? And each file is compressed with gzip. It will create lots of small files. But loading them would be faster. * If files need to be appended, is it a better way to use the same size (6k for example) for each page? If the page is larger than 6k, the rest of its content will be stored in another file. The searching will be much faster. Thanks. Regards, Tom __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
