On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> brainstorm wrote:
>
>> This is one example crawled segment:
>>
>> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
>>
>> As you see, just one part-NNNN file is generated... in the conf file
>> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
>> suggested in previous emails).
>
> First of all - for a 7 node cluster the mapred.map.tasks should be set at
> least to something around 23 or 31 or even higher, and the number of reduce
> tasks to e.g. 11.



I see, now it makes more sense to me than just assigning 2 maps by
default as suggested before... then, according to:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Maps:

Given:
64MB DFS blocks
500MB RAM per node
500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
exceptions occur)

31 maps... we'll see if it works. It would be cool to have a more
precise "formula" to calculate this number in the nutch case. I assume
that "23 to 31 or higher" is empirically determined by you: thanks for
sharing your knowledge !

Reduces:
1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) = 135

This number is the total number of reduces running across the cluster nodes ?




> Secondly - you should not put this property in nutch-site.xml, instead it
> should be put in mapred-default.xml or hadoop-site.xml. I lost track of
> which version of Nutch / Hadoop you are using ... if it's Hadoop 0.12.x,
> then you need to be careful about where you put mapred.map.tasks, and it has
> to be placed in mapred-default.xml. If it's a more recent Hadoop version
> then you can put these values in hadoop-site.xml.



My fault ! I actually meant hadoop-site.xml... besides,
mapred-default.xml is ignored by hadoop in my case:

I'm using hadoop 0.17.1, included on latest nutch trunk as of now.



> And finally - what is the distribution of urls in your seed list among
> unique hosts? I.e. how many urls come from a single host? Guessing from the
> path above - if you are trying to do a DMOZ crawl, then the distribution
> should be ok. I've done a DMOZ crawl a month ago, using the then current
> trunk/ and all was working well.



I've made the following ruby snippet to get an idea of the
distribution of the input url list (perhaps it is not the paramount of
correctness and accuracy, but I think it more or less shows what we're
looking for):

# invert_urls.rb
#!/usr/bin/ruby
require 'pp'

dist = {}

STDIN.readlines.each do |url|
        url=url.strip[7..-1] # strip http://
        url=url.split(".") # array context for "proper" reverse
        url.reverse!
        
        dist.merge!({url[0]+'.'+url[1] => ""}) # hash context, discarding
duplicate urls (just till 1st level)
end

pp dist

[EMAIL PROTECTED]:~/bin$ wc -l urls.txt
2500001 urls.txt

[EMAIL PROTECTED]:~/bin$ ./invert_urls.rb < urls.txt > unique

[EMAIL PROTECTED]:~/bin$ wc -l unique
1706762 unique <---- coming from *different* hosts


So there are roughly 790000 (2500001-1706762) "repeated" urls... 31%
of the sample

[EMAIL PROTECTED]:~/bin$ head urls.txt
http://business-card-flyer-free-post.qoxa.info
http://www.download-art.com
http://catcenter.co.uk
http://761.hidbi.info
http://seemovie.movblogs.com
http://clearwaterart.com
http://www.travel-insurance-guide.org
http://www.pc-notdienst.at
http://projec-txt.cn
http://www.yoraispage.com

[EMAIL PROTECTED]:~/bin$ head unique
{"de.tsv-nellmersbach"=>"",
 "cn.color-that-pokemon"=>"",
 "com.bluestar-studio"=>"",
 "it.vaisardegna"=>"",
 "com.bramalearangersclub"=>"",
 "org.fpc-hou"=>"",
 "com.warhotel"=>"",
 "com.tokayblue"=>"",
 "be.wgreekwaves"=>"",
 "org.fairhopelibrary"=>"",

Comparing with DMOZ sample:

[EMAIL PROTECTED]:~/bin$ ./invert_urls.rb < random-dmoz-20080806.txt > unique
[EMAIL PROTECTED]:~/bin$ wc -l random-dmoz-20080806.txt
908 random-dmoz-20080806.txt
[EMAIL PROTECTED]:~/bin$ wc -l unique
788 unique

...13% repeated urls

In conclusion, as you predicted (and if the script is not horribly
broken), the non-dmoz sample is quite homogeneous (there are lots of
urls coming from auto-generated ad sites, for instance)... adding the
fact that *a lot* of them lead to "Unknown host exceptions", the crawl
ends being extremely slow.

But that does not solve the fact that few nodes are actually fetching
on DMOZ-based crawl. So next thing to try is to raise
mapred.map.tasks.maximum as you suggested, should fix my issues... I
hope so :/

Thanks !




> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to