K, hadoopized the script, though i've tried it only locally. I rethought (lazyness convinced me) not to include the indexer parameter.
On Mon, Mar 28, 2011 at 10:50 AM, Gabriele Kahlout <[email protected] > wrote: > > > On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche < > [email protected]> wrote: > >> Hi Gabriele >> >> >>>> you don't need to have 2 *and *3. The hadoop commands will work on the >>>> local fs in a completely transparent way, it all depends on the way hadoop >>>> is configured. It isolates the way data are stored (local or distrib) from >>>> the client code i.e Nutch. By adding a separate script using fs, you'd add >>>> more confusion and lead beginners to think that they HAVE to use fs. >>>> >>> >>> I apologize for not having yet looked into hadoop in detail but I had >>> understood that it would abstract over the single machine fs. >>> >> >> No problems. It would be worth spending a bit of time reading about Hadoop >> if you want to get a better understanding of Nutch. Tom White's book is an >> excellent reference but the wikis and tutorials would be a good start >> >> >> >>> However, to get up and running after downloading nutch will the script >>> just work or will I have to configure hadoop? I assume the latter. >>> >> >> Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API >> for getting its inputs, so when you run it as you did what actually happens >> is that you are getting the data from the local FS via Hadoop. >> > > I'll look into it and update the script accordingly. > >> >> >>> From a beginner prospective I like to reduce the magic (at first) and see >>> through the commands, and get up and running asap. >>> Hence 2. I'll be using 3. >>> >> >> Hadoop already reduces the magic for you :-) >> >> > Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of > the hadoop cmds and get rid of 2. > > >> >>> >>>> >>>> As for the legacy-lucene vs SOLR what about having a parameter to >>>> determine which one should be used and have a single script? >>>> >>>> >>> Excellent idea. The default is solr for 1 and 3, but one passes parameter >>> 'll' it will use the legacy lucene. The default for 2 is ll since we want to >>> get up and running fast (before knowing what solr is and set it up). >>> >> >> It would be nice to have a third possible value (i.e. none) for the >> parameter -indexer (besides solr and lucene). A lot of people use Nutch as a >> crawling platform but do not do any indexing >> >> agreed. Will add that too. > > >> >>>> Why do you want to get the info about ALL the urls? There is a readdb >>>> -stats command which gives an summary of the content of the crawldb. If you >>>> need to check a particular URL or domain, just use readdb -url and readdb >>>> -regex (or whatever the name of the param is) >>>> >>> >>>> >>> At least when debugging/troubleshooting I found it useful to see which >>> urls were fetched and the responses (robot_blocked, etc..). >>> I can do that examining each $it_crawlddb in turn, since i don't know >>> when that url was fetched (although since the fetching is pretty linear I >>> could also find out, sth. like index in seeds/urls / $it_size. >>> >> >> better to do that by looking at the content of the segments using 'nutch >> readseg -dump' or using 'hadoop fs -libjars nutch.job >> segment/SEGMENTNUM/crawl_data' for instance. That's probably not something >> that most people will want to do so maybe comment it out in your script? >> >> >> running hadoop in peudo distributed mode and looking at the hadoop web >> guis (http://*localhost*:*50030*) gives you a lot of information about >> your crawl >> >> It would definitely be better to have a single crawldb in your script. >> >> > agreed, maybe again an option and the default is none. But keep every > $it_crawldb instead of deleting and merging them. > I should be looking into the necessary Hadoop today and start updating the > script accordingly. > > Julien >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> > > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

