Lucas Pauchard created NUTCH-2975: ------------------------------------- Summary: Generate 0 partition when used with sitemap Key: NUTCH-2975 URL: https://issues.apache.org/jira/browse/NUTCH-2975 Project: Nutch Issue Type: Bug Components: generator, sitemap Affects Versions: 1.16 Environment: Hadoop 3.1.1
Reporter: Lucas Pauchard *Issue* We are facing strange issue since we have updated our Proxmox from 7.2-4 to 7.2-11 which host the VMs/containers used for our Hadoop cluster. When we are using the sitemap component to add URLs, the generator process doesn't work. It generates 0 partition. But if we call a second time the generator process, this time the generator actually create a partition segment. It happens only when we use the sitemap process. If we use only the Injector process, this issue doesn't happen. I checked the logs and the generator just seems to find no record in the crawldb. It is like the crawldb wasn't available or the files are locked. *Here is the command used :* Sitemap : {code} hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm -force {code} It returns as expected : {panel:title=Sitemap output} 2022-11-23 10:37:22,194 INFO util.SitemapProcessor: SitemapProcessor: Total records rejected by filters: 0 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total sitemaps from HostDb: 0 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total sitemaps from seed urls: 1 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total failed sitemap fetches: 0 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total new sitemap entries added: 151 {panel} Generetor : {code} hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm -force {code} 1st time it returns : {code} 2022-11-23 11:25:15,202 WARN crawl.Generator: Generator: 0 records selected for fetching, exiting ... {code} 2nd time it returns : {code} 2022-11-23 11:27:43,007 INFO crawl.Generator: Generator: Partitioning selected urls for politeness. 2022-11-23 11:27:44,009 INFO crawl.Generator: Generator: segment: crawl_000_111/segment/20221123112744 ... 2022-11-23 11:28:34,061 INFO crawl.Generator: Generator: finished at 2022-11-23 11:28:34, elapsed: 00:01:53 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)