On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > brainstorm wrote: > >> This is one example crawled segment: >> >> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000 >> >> As you see, just one part-NNNN file is generated... in the conf file >> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as >> suggested in previous emails). > > First of all - for a 7 node cluster the mapred.map.tasks should be set at > least to something around 23 or 31 or even higher, and the number of reduce > tasks to e.g. 11.
I see, now it makes more sense to me than just assigning 2 maps by default as suggested before... then, according to: http://wiki.apache.org/hadoop/HowManyMapsAndReduces Maps: Given: 64MB DFS blocks 500MB RAM per node 500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace exceptions occur) 31 maps... we'll see if it works. It would be cool to have a more precise "formula" to calculate this number in the nutch case. I assume that "23 to 31 or higher" is empirically determined by you: thanks for sharing your knowledge ! Reduces: 1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) = 135 This number is the total number of reduces running across the cluster nodes ? > Secondly - you should not put this property in nutch-site.xml, instead it > should be put in mapred-default.xml or hadoop-site.xml. I lost track of > which version of Nutch / Hadoop you are using ... if it's Hadoop 0.12.x, > then you need to be careful about where you put mapred.map.tasks, and it has > to be placed in mapred-default.xml. If it's a more recent Hadoop version > then you can put these values in hadoop-site.xml. My fault ! I actually meant hadoop-site.xml... besides, mapred-default.xml is ignored by hadoop in my case: I'm using hadoop 0.17.1, included on latest nutch trunk as of now. > And finally - what is the distribution of urls in your seed list among > unique hosts? I.e. how many urls come from a single host? Guessing from the > path above - if you are trying to do a DMOZ crawl, then the distribution > should be ok. I've done a DMOZ crawl a month ago, using the then current > trunk/ and all was working well. I've made the following ruby snippet to get an idea of the distribution of the input url list (perhaps it is not the paramount of correctness and accuracy, but I think it more or less shows what we're looking for): # invert_urls.rb #!/usr/bin/ruby require 'pp' dist = {} STDIN.readlines.each do |url| url=url.strip[7..-1] # strip http:// url=url.split(".") # array context for "proper" reverse url.reverse! dist.merge!({url[0]+'.'+url[1] => ""}) # hash context, discarding duplicate urls (just till 1st level) end pp dist [EMAIL PROTECTED]:~/bin$ wc -l urls.txt 2500001 urls.txt [EMAIL PROTECTED]:~/bin$ ./invert_urls.rb < urls.txt > unique [EMAIL PROTECTED]:~/bin$ wc -l unique 1706762 unique <---- coming from *different* hosts So there are roughly 790000 (2500001-1706762) "repeated" urls... 31% of the sample [EMAIL PROTECTED]:~/bin$ head urls.txt http://business-card-flyer-free-post.qoxa.info http://www.download-art.com http://catcenter.co.uk http://761.hidbi.info http://seemovie.movblogs.com http://clearwaterart.com http://www.travel-insurance-guide.org http://www.pc-notdienst.at http://projec-txt.cn http://www.yoraispage.com [EMAIL PROTECTED]:~/bin$ head unique {"de.tsv-nellmersbach"=>"", "cn.color-that-pokemon"=>"", "com.bluestar-studio"=>"", "it.vaisardegna"=>"", "com.bramalearangersclub"=>"", "org.fpc-hou"=>"", "com.warhotel"=>"", "com.tokayblue"=>"", "be.wgreekwaves"=>"", "org.fairhopelibrary"=>"", Comparing with DMOZ sample: [EMAIL PROTECTED]:~/bin$ ./invert_urls.rb < random-dmoz-20080806.txt > unique [EMAIL PROTECTED]:~/bin$ wc -l random-dmoz-20080806.txt 908 random-dmoz-20080806.txt [EMAIL PROTECTED]:~/bin$ wc -l unique 788 unique ...13% repeated urls In conclusion, as you predicted (and if the script is not horribly broken), the non-dmoz sample is quite homogeneous (there are lots of urls coming from auto-generated ad sites, for instance)... adding the fact that *a lot* of them lead to "Unknown host exceptions", the crawl ends being extremely slow. But that does not solve the fact that few nodes are actually fetching on DMOZ-based crawl. So next thing to try is to raise mapred.map.tasks.maximum as you suggested, should fix my issues... I hope so :/ Thanks ! > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
