[
https://issues.apache.org/jira/browse/NUTCH-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrzej Bialecki updated NUTCH-584:
------------------------------------
Attachment: generator.patch
Patch to address this problem - your test case executes fine with this patch.
Please test.
> urls missing from fetchlist
> ---------------------------
>
> Key: NUTCH-584
> URL: https://issues.apache.org/jira/browse/NUTCH-584
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 0.9.0, 1.0.0
> Environment: FreeBSD 7.0, JDK 1.5.0, Nu
> Reporter: Ruslan Ermilov
> Attachments: generator.patch
>
>
> When generating an initial set of ~100k URLs for fetching, I've noticed that
> some URLs are missing from the fetchlist.
> The test case below has only 2 URLs, and I've used the FreeGenerator tool
> instead of the standard inject/generate
> that saves me time when experimenting. It doesn't matter if I run it in
> clustered or local mode.
> Somehow only one of two URLs ends up in the fetchlist:
> $ rm -rf segments
> $ cat urls/x
> http://tkd.ru/
> http://t-f.ru/
> $ nutch org.apache.nutch.tools.FreeGenerator urls segments
> $ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata
> -noparsetext -nofetch
> SegmentReader: dump segment: segments/20071128195720
> SegmentReader: done
> $ cat xxx/dump
> Recno:: 0
> URL:: http://tkd.ru/
> CrawlDatum::
> Version: 5
> Status: 0 (unknown)
> Fetch time: Wed Nov 28 19:57:20 GMT 2007
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: null
> Metadata: null
> $
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.