[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel reassigned NUTCH-1252: -------------------------------------- Assignee: Sebastian Nagel > SegmentReader -get shows wrong data > ----------------------------------- > > Key: NUTCH-1252 > URL: https://issues.apache.org/jira/browse/NUTCH-1252 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.4, 1.5 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Fix For: 1.6 > > Attachments: NUTCH-1252.patch, NUTCH-1252-v2.patch > > > The command/option -get of the SegmentReader may show wrong data associated > with the given URL. > To reproduce: > {code} > % mkdir -p test_readseg/urls > % echo -e > "http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0" > > test_readseg/urls/seeds > % nutch inject test_readseg/crawldb test_readseg/urls > Injector: starting at 2012-01-18 09:32:25 > Injector: crawlDb: test_readseg/crawldb > Injector: urlDir: test_readseg/urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03 > % nutch generate test_readseg/crawldb test_readseg/segments/ > Generator: starting at 2012-01-18 09:32:30 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: test_readseg/segments/20120118093232 > Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03 > % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' > -nocontent -noparse -nofetch -noparsedata -noparsetext > SegmentReader: get 'http://nutch.apache.org/' > Crawl Generate:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Jan 18 09:32:26 CET 2012 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 10.0 > Signature: null > Metadata: _ngt_: 1326875550401test: AbcTest > {code} > The metadata and the score indicate that the CrawlDatum shown is the wrong > one (that associated to http://abc.test/ but not to http://nutch.apache.org/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira