Tutorial%20on%20incremental%20crawling

Julien Nioche Sun, 27 Mar 2011 05:35:00 -0700

Gabriele,

 I think it is a good idea to have a script like this however your proposal
could be improved. It currently works only on a single machine and uses
commands such as mv, ls etc... which won't work on a pseudo or fully
distributed cluster. You should use the 'hadoop fs' commands instead.
If I understand the script correctly, you then merge different crawldbs. Why
do you do that? There should be one crawldb per crawl so I don't think this
is at all necessary.


Having a script would definitely be a plus for beginners and would give more
flexibility than the crawl command.

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Reply via email to