Lucas Pauchard created NUTCH-2975:
-------------------------------------

             Summary: Generate 0 partition when used with sitemap
                 Key: NUTCH-2975
                 URL: https://issues.apache.org/jira/browse/NUTCH-2975
             Project: Nutch
          Issue Type: Bug
          Components: generator, sitemap
    Affects Versions: 1.16
         Environment: Hadoop 3.1.1

            Reporter: Lucas Pauchard


*Issue*
We are facing strange issue since we have updated our Proxmox from 7.2-4 to 
7.2-11 which host the VMs/containers used for our Hadoop cluster.

When we are using the sitemap component to add URLs, the generator process 
doesn't work. It generates 0 partition.

But if we call a second time the generator process, this time the generator 
actually create a partition segment.

It happens only when we use the sitemap process. If we use only the Injector 
process, this issue doesn't happen.
I checked the logs and the generator just seems to find no record in the 
crawldb. It is like the crawldb wasn't available or the files are locked.

*Here is the command used :*
Sitemap :

{code}
hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb 
crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm 
-force
{code}

It returns as expected :
{panel:title=Sitemap output}
2022-11-23 10:37:22,194 INFO util.SitemapProcessor: SitemapProcessor: Total 
records rejected by filters: 0
2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total 
sitemaps from HostDb: 0
2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total 
sitemaps from seed urls: 1
2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total 
failed sitemap fetches: 0
2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total new 
sitemap entries added: 151
{panel}

Generetor :
{code}
hadoop jar <job> org.apache.nutch.crawl.Generator crawl_000_111/crawldb 
crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm 
-force
{code}

1st time it returns :
{code}
2022-11-23 11:25:15,202 WARN crawl.Generator: Generator: 0 records selected for 
fetching, exiting ...
{code}
2nd time it returns :
{code}
2022-11-23 11:27:43,007 INFO crawl.Generator: Generator: Partitioning selected 
urls for politeness.
2022-11-23 11:27:44,009 INFO crawl.Generator: Generator: segment: 
crawl_000_111/segment/20221123112744
...
2022-11-23 11:28:34,061 INFO crawl.Generator: Generator: finished at 2022-11-23 
11:28:34, elapsed: 00:01:53
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to