> I think it is a good idea to have a script like this however your proposal >> could be improved. It currently works only on a single machine and uses >> commands such as mv, ls etc... which won't work on a pseudo or fully >> distributed cluster. You should use the 'hadoop fs' commands instead. >> > > Okay, let's go for 3 editions: > 1. that's abridged and works only with solr (tersest script) > 2. unabridged with local fs - for begginners > 3. hadoop unabridged >
you don't need to have 2 *and *3. The hadoop commands will work on the local fs in a completely transparent way, it all depends on the way hadoop is configured. It isolates the way data are stored (local or distrib) from the client code i.e Nutch. By adding a separate script using fs, you'd add more confusion and lead beginners to think that they HAVE to use fs. As for the legacy-lucene vs SOLR what about having a parameter to determine which one should be used and have a single script? > > >> If I understand the script correctly, you then merge different crawldbs. >> Why do you do that? There should be one crawldb per crawl so I don't think >> this is at all necessary. >> >> So that I get a single dump with info about all the urls crawled. On the > scale of the web this is probably a bad idea, isn't it? > it would be a bad idea even on a medium scale. That sort of works on a single machine but as soon as you'd get a bit of data you'd fill the space on the disks + the whole thing would take ages. However the point still is that there should be only one crawldb per crawl and it contains all the urls you've injected / discovered > But then how else could you inspect all the crawled urls at once? > Why do you want to get the info about ALL the urls? There is a readdb -stats command which gives an summary of the content of the crawldb. If you need to check a particular URL or domain, just use readdb -url and readdb -regex (or whatever the name of the param is) > > >> Having a script would definitely be a plus for beginners and would give >> more flexibility than the crawl command. >> > > I as the first of beginners. Crawl is not recommended for whole-web > crawling i guess because it doesn't work incrementally. Why not add such > option to crawl? Shall I feature-request/patch for that? > IMHO I'd rather have a good and reliable script to replace the Crawl command. It does not help people understanding the underlying steps + having the script would be easier to recover when there is a failure + there are other issues with it e.g. the runaway parsing threads which are kept in the VM. I am sure the all-in-one Crawl command helped many a user but the script would do just as well Thanks for yor contribution BTW Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

