On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche < [email protected]> wrote:
> Hi Gabriele > > >>> you don't need to have 2 *and *3. The hadoop commands will work on the >>> local fs in a completely transparent way, it all depends on the way hadoop >>> is configured. It isolates the way data are stored (local or distrib) from >>> the client code i.e Nutch. By adding a separate script using fs, you'd add >>> more confusion and lead beginners to think that they HAVE to use fs. >>> >> >> I apologize for not having yet looked into hadoop in detail but I had >> understood that it would abstract over the single machine fs. >> > > No problems. It would be worth spending a bit of time reading about Hadoop > if you want to get a better understanding of Nutch. Tom White's book is an > excellent reference but the wikis and tutorials would be a good start > > > >> However, to get up and running after downloading nutch will the script >> just work or will I have to configure hadoop? I assume the latter. >> > > Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API > for getting its inputs, so when you run it as you did what actually happens > is that you are getting the data from the local FS via Hadoop. > I'll look into it and update the script accordingly. > > >> From a beginner prospective I like to reduce the magic (at first) and see >> through the commands, and get up and running asap. >> Hence 2. I'll be using 3. >> > > Hadoop already reduces the magic for you :-) > > Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of the hadoop cmds and get rid of 2. > >> >>> >>> As for the legacy-lucene vs SOLR what about having a parameter to >>> determine which one should be used and have a single script? >>> >>> >> Excellent idea. The default is solr for 1 and 3, but one passes parameter >> 'll' it will use the legacy lucene. The default for 2 is ll since we want to >> get up and running fast (before knowing what solr is and set it up). >> > > It would be nice to have a third possible value (i.e. none) for the > parameter -indexer (besides solr and lucene). A lot of people use Nutch as a > crawling platform but do not do any indexing > > agreed. Will add that too. > >>> Why do you want to get the info about ALL the urls? There is a readdb >>> -stats command which gives an summary of the content of the crawldb. If you >>> need to check a particular URL or domain, just use readdb -url and readdb >>> -regex (or whatever the name of the param is) >>> >> >>> >> At least when debugging/troubleshooting I found it useful to see which >> urls were fetched and the responses (robot_blocked, etc..). >> I can do that examining each $it_crawlddb in turn, since i don't know when >> that url was fetched (although since the fetching is pretty linear I could >> also find out, sth. like index in seeds/urls / $it_size. >> > > better to do that by looking at the content of the segments using 'nutch > readseg -dump' or using 'hadoop fs -libjars nutch.job > segment/SEGMENTNUM/crawl_data' for instance. That's probably not something > that most people will want to do so maybe comment it out in your script? > > > running hadoop in peudo distributed mode and looking at the hadoop web guis > (http://*localhost*:*50030*) gives you a lot of information about your > crawl > > It would definitely be better to have a single crawldb in your script. > > agreed, maybe again an option and the default is none. But keep every $it_crawldb instead of deleting and merging them. I should be looking into the necessary Hadoop today and start updating the script accordingly. Julien > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

