On Sun, Mar 27, 2011 at 7:03 PM, Julien Nioche < [email protected]> wrote:
> > I think it is a good idea to have a script like this however your proposal >>> could be improved. It currently works only on a single machine and uses >>> commands such as mv, ls etc... which won't work on a pseudo or fully >>> distributed cluster. You should use the 'hadoop fs' commands instead. >>> >> >> Okay, let's go for 3 editions: >> 1. that's abridged and works only with solr (tersest script) >> 2. unabridged with local fs - for begginners >> 3. hadoop unabridged >> > > you don't need to have 2 *and *3. The hadoop commands will work on the > local fs in a completely transparent way, it all depends on the way hadoop > is configured. It isolates the way data are stored (local or distrib) from > the client code i.e Nutch. By adding a separate script using fs, you'd add > more confusion and lead beginners to think that they HAVE to use fs. > I apologize for not having yet looked into hadoop in detail but I had understood that it would abstract over the single machine fs. However, to get up and running after downloading nutch will the script just work or will I have to configure hadoop? I assume the latter. From a beginner prospective I like to reduce the magic (at first) and see through the commands, and get up and running asap. Hence 2. I'll be using 3. > > As for the legacy-lucene vs SOLR what about having a parameter to determine > which one should be used and have a single script? > > Excellent idea. The default is solr for 1 and 3, but one passes parameter 'll' it will use the legacy lucene. The default for 2 is ll since we want to get up and running fast (before knowing what solr is and set it up). > >> >>> If I understand the script correctly, you then merge different crawldbs. >>> Why do you do that? There should be one crawldb per crawl so I don't think >>> this is at all necessary. >>> >>> So that I get a single dump with info about all the urls crawled. On the >> scale of the web this is probably a bad idea, isn't it? >> > > it would be a bad idea even on a medium scale. That sort of works on a > single machine but as soon as you'd get a bit of data you'd fill the space > on the disks + the whole thing would take ages. > > However the point still is that there should be only one crawldb per crawl > and it contains all the urls you've injected / discovered > > >> But then how else could you inspect all the crawled urls at once? >> > > Why do you want to get the info about ALL the urls? There is a readdb > -stats command which gives an summary of the content of the crawldb. If you > need to check a particular URL or domain, just use readdb -url and readdb > -regex (or whatever the name of the param is) > > At least when debugging/troubleshooting I found it useful to see which urls were fetched and the responses (robot_blocked, etc..). I can do that examining each $it_crawlddb in turn, since i don't know when that url was fetched (although since the fetching is pretty linear I could also find out, sth. like index in seeds/urls / $it_size. Now I don't about hadoop is a single file stored on a single machine. My expectation/hope was that the underlying fs loads into memory only the portions of text i'm viewing (I've seen that around) and I dunno how it'll handle ctrl+F (maybe some index). And if it has disk space issues it breaks that up on to several machines, transparently. > >> >>> Having a script would definitely be a plus for beginners and would give >>> more flexibility than the crawl command. >>> >> >> I as the first of beginners. Crawl is not recommended for whole-web >> crawling i guess because it doesn't work incrementally. Why not add such >> option to crawl? Shall I feature-request/patch for that? >> > > IMHO I'd rather have a good and reliable script to replace the Crawl > command. > > Thanks for yor contribution BTW > welcome. I like Apache's stuff, and thank you for saving me the trouble of re-implementing a search engine atop much more limited frameworks! > Julien > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

