[ 
https://issues.apache.org/jira/browse/NUTCH-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-584:
------------------------------------

    Attachment: generator.patch

Patch to address this problem - your test case executes fine with this patch. 
Please test.

> urls missing from fetchlist
> ---------------------------
>
>                 Key: NUTCH-584
>                 URL: https://issues.apache.org/jira/browse/NUTCH-584
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0, 1.0.0
>         Environment: FreeBSD 7.0, JDK 1.5.0, Nu
>            Reporter: Ruslan Ermilov
>         Attachments: generator.patch
>
>
> When generating an initial set of ~100k URLs for fetching, I've noticed that 
> some URLs are missing from the fetchlist.
> The test case below has only 2 URLs, and I've used the FreeGenerator tool 
> instead of the standard inject/generate
> that saves me time when experimenting. It doesn't matter if I run it in 
> clustered or local mode.
> Somehow only one of two URLs ends up in the fetchlist:
> $ rm -rf segments
> $ cat urls/x
> http://tkd.ru/
> http://t-f.ru/
> $ nutch org.apache.nutch.tools.FreeGenerator urls segments
> $ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata 
> -noparsetext -nofetch
> SegmentReader: dump segment: segments/20071128195720
> SegmentReader: done
> $ cat xxx/dump
> Recno:: 0
> URL:: http://tkd.ru/
> CrawlDatum::
> Version: 5
> Status: 0 (unknown)
> Fetch time: Wed Nov 28 19:57:20 GMT 2007
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: null
> Metadata: null
> $ 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to