Re: Nutch Several Segment Folders Containing Duplicate Key/URLs

Tejas Patil Tue, 24 Dec 2013 11:03:07 -0800

You ran 3 rounds of nutch crawl ("-depth 3") and those 3 folders are 3
segments created for each round of crawl.
About the 520 URLs, I don't see any obvious reason for that happening. You
should see few of the new urls that were added, what were their parent url
and then run a small crawl using those parent as seeds.


Thanks,
Tejas


On Tue, Dec 24, 2013 at 8:06 AM, Bin Wang <[email protected]> wrote:

> Hi,
>
> I have a very specific list of URLs to crawl and I implemented it by
> turning off this property:
> <property>
>   <name>db.update.additions.allowed</name>
>   <value>false</value>
>   <description>If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   </description>
> </property>
>
> So it will not add the parsed new URLs / outbound links into the crawldb.
>
> I tried to feed one link to Nutch and it works exactly the way I want, and
> I can read the raw HTML by going to the deserialize the
> segment/content/part-000/data file.
>
> However, when I feed 520 URLs to Nutch, the result is confusing me.
> It created 3 separate folders and each one has the same structure as the
> folder I just mentioned. When I check the data files in each folder...
> folder 1 contains:
> 400 URLs and their HTML
> folder 2 contains:
> 487 URLs ..
> folder 3 contains:
> 520 URLs ..
>
> And they add up to about 1400! There are many duplicates when you add them
> up and there are 900 distinct URLs in total which is even more than the
> URLs that I fed Nutch.
>
> Here is the research that I have done:
> I have read the source code for Injector and am working on Fetcher..
> Somehow it mentioned in the Fetcher that "the number of Queues is based on
> the number of hosts..."  And I am wondering does that have anything to do
> with how that three folders come.
>
> Can anyone help me understand how that three folders come to existence and
> why the URLs number is so weird.
>
> Any hint is appreciated or point me to the right class so I can do some
> homework myself.
> ------------------
> Extra Info:
> I am using AWS EC2 Ubuntu and Nutch 1.7
> the command to run the crawl:
> nohup bin/nutch crawl urls -dir result -depth 3 -topN 10000 &
> -------------------
>
> /usr/bin
>
>
>
>
>
>

Re: Nutch Several Segment Folders Containing Duplicate Key/URLs

Reply via email to