Hi,

I am running few cycles of fetching on nutch 0.8 and I notice that the data size is much smaller than the data size I got in version 0.7 (running the same cycle about the same time from different machines), about 5G after the third cycle starting with about 72000 URLs . All the processes ended sucssesfuly, everything seems to be fine but I am afraid that I'm missing somthing.


Each cycle includes :
fetch segments/..
updatedb crawldb segments/..
generate crawldb segments

The configuration in nutch-site.xml are :
<property>
 <name>fs.default.name</name>
 <value>machine1:50000</value>
</property>

<property>
 <name>mapred.job.tracker</name>
 <value>machine1:50020</value>
</property>

<property>
 <name>ndfs.name.dir</name>
 <value>/home/nutch_svn/nutch/trunk/ndfs/name</value>
</property>

<property>
 <name>ndfs.data.dir</name>
 <value>/home/nutch_svn/nutch/trunk/ndfs/data</value>
</property>

<property>
 <name>mapred.local.dir</name>
 <value>/home/nutch_svn/nutch/trunk/mapred/local</value>
</property>

<property>
 <name>mapred.system.dir</name>
 <value>/home/nutch_svn/nutch/trunk/mapred/system</value>
</property>

<property>
 <name>mapred.temp.dir</name>
 <value>/home/nutch_svn/nutch/trunk/mapred/temp</value>
</property>

<property>
 <name>mapred.map.tasks</name>
 <value>12</value>
</property>

<property>
 <name>mapred.reduce.tasks</name>
 <value>6</value>
</property>

<property>
 <name>generate.max.per.host</name>
 <value>-1</value>
</property>


Thanks,
-Rafi

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to get there! http://lifeevents.msn.com/category.aspx?cid=Retirement



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to