Hi,
I am running few cycles of fetching on nutch 0.8 and I notice that the data
size is much smaller than the data size I got in version 0.7 (running the
same cycle about the same time from different machines), about 5G after the
third cycle starting with about 72000 URLs .
All the processes ended sucssesfuly, everything seems to be fine but I am
afraid that I'm missing somthing.
Each cycle includes :
fetch segments/..
updatedb crawldb segments/..
generate crawldb segments
The configuration in nutch-site.xml are :
<property>
<name>fs.default.name</name>
<value>machine1:50000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>machine1:50020</value>
</property>
<property>
<name>ndfs.name.dir</name>
<value>/home/nutch_svn/nutch/trunk/ndfs/name</value>
</property>
<property>
<name>ndfs.data.dir</name>
<value>/home/nutch_svn/nutch/trunk/ndfs/data</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/nutch_svn/nutch/trunk/mapred/local</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/nutch_svn/nutch/trunk/mapred/system</value>
</property>
<property>
<name>mapred.temp.dir</name>
<value>/home/nutch_svn/nutch/trunk/mapred/temp</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>12</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>6</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>-1</value>
</property>
Thanks,
-Rafi
_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers