Nutch Several Segment Folders Containing Duplicate Key/URLs

Bin Wang Tue, 24 Dec 2013 08:07:09 -0800

Hi,

I have a very specific list of URLs to crawl and I implemented it by
turning off this property:
<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>


So it will not add the parsed new URLs / outbound links into the crawldb.

I tried to feed one link to Nutch and it works exactly the way I want, and
I can read the raw HTML by going to the deserialize the
segment/content/part-000/data file.

However, when I feed 520 URLs to Nutch, the result is confusing me.
It created 3 separate folders and each one has the same structure as the
folder I just mentioned. When I check the data files in each folder...
folder 1 contains:
400 URLs and their HTML
folder 2 contains:
487 URLs ..
folder 3 contains:
520 URLs ..

And they add up to about 1400! There are many duplicates when you add them
up and there are 900 distinct URLs in total which is even more than the
URLs that I fed Nutch.

Here is the research that I have done:
I have read the source code for Injector and am working on Fetcher..
Somehow it mentioned in the Fetcher that "the number of Queues is based on
the number of hosts..."  And I am wondering does that have anything to do
with how that three folders come.

Can anyone help me understand how that three folders come to existence and
why the URLs number is so weird.

Any hint is appreciated or point me to the right class so I can do some
homework myself.
------------------
Extra Info:
I am using AWS EC2 Ubuntu and Nutch 1.7
the command to run the crawl:
nohup bin/nutch crawl urls -dir result -depth 3 -topN 10000 &
-------------------

/usr/bin

Nutch Several Segment Folders Containing Duplicate Key/URLs

Reply via email to