[ https://issues.apache.org/jira/browse/NUTCH-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17637810#comment-17637810 ]
Sebastian Nagel commented on NUTCH-2975: ---------------------------------------- The Generator log message only says that no records were selected. Could you - provide the full logs, i.e. the Hadoop task logs - look into the CrawlDb after SitemapProcessor was run and verify that it's definitely of size 0 - share the sitemap ... or at least some snippets. Otherwise it's hard to figure out the reason. Just one assumption which could be the reason: SitemapProcessor writes the modification time, priority and fetch interval from the sitemap to the CrawlDb. This is only possible if you can trust these values. A modification time in the future, or a score of 0.0 (not larger than the default of the property generate.min.score) could cause that the URLs from the sitemap are not eligible for crawling. > Generate 0 partition when used with sitemap > ------------------------------------------- > > Key: NUTCH-2975 > URL: https://issues.apache.org/jira/browse/NUTCH-2975 > Project: Nutch > Issue Type: Bug > Components: generator, sitemap > Affects Versions: 1.16 > Environment: Hadoop 3.1.1 > Reporter: Lucas Pauchard > Priority: Major > > *Issue* > We are facing strange issue since we have updated our Proxmox from 7.2-4 to > 7.2-11 which host the VMs/containers used for our Hadoop cluster. > When we are using the sitemap component to add URLs, the generator process > doesn't work. It generates 0 partition. > But if we call a second time the generator process, this time the generator > actually create a partition segment. > It happens only when we use the sitemap process. If we use only the Injector > process, this issue doesn't happen. > I checked the logs and the generator just seems to find no record in the > crawldb. It is like the crawldb wasn't available or the files are locked. > *Here is the command used :* > Sitemap : > {code} > hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb > crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm > -force > {code} > It returns as expected : > {panel:title=Sitemap output} > 2022-11-23 10:37:22,194 INFO util.SitemapProcessor: SitemapProcessor: Total > records rejected by filters: 0 > 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total > sitemaps from HostDb: 0 > 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total > sitemaps from seed urls: 1 > 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total > failed sitemap fetches: 0 > 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total > new sitemap entries added: 151 > {panel} > Generetor : > {code} > hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb > crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm > -force > {code} > 1st time it returns : > {code} > 2022-11-23 11:25:15,202 WARN crawl.Generator: Generator: 0 records selected > for fetching, exiting ... > {code} > 2nd time it returns : > {code} > 2022-11-23 11:27:43,007 INFO crawl.Generator: Generator: Partitioning > selected urls for politeness. > 2022-11-23 11:27:44,009 INFO crawl.Generator: Generator: segment: > crawl_000_111/segment/20221123112744 > ... > 2022-11-23 11:28:34,061 INFO crawl.Generator: Generator: finished at > 2022-11-23 11:28:34, elapsed: 00:01:53 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)