[jira] [Commented] (NUTCH-2975) Generate 0 partition when used with sitemap

2022-11-24 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638291#comment-17638291
 ] 

Sebastian Nagel commented on NUTCH-2975:


Thanks, [~lucasp] for the notice. I've opened NUTCH-2976 to verify sitemap 
values. Similar issues might happen if the values found in sitemaps are trusted 
blindly.

> Generate 0 partition when used with sitemap
> ---
>
> Key: NUTCH-2975
> URL: https://issues.apache.org/jira/browse/NUTCH-2975
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, sitemap
>Affects Versions: 1.16
> Environment: Hadoop 3.1.1
>Reporter: Lucas Pauchard
>Priority: Major
>
> *Issue*
> We are facing strange issue since we have updated our Proxmox from 7.2-4 to 
> 7.2-11 which host the VMs/containers used for our Hadoop cluster.
> When we are using the sitemap component to add URLs, the generator process 
> doesn't work. It generates 0 partition.
> But if we call a second time the generator process, this time the generator 
> actually create a partition segment.
> It happens only when we use the sitemap process. If we use only the Injector 
> process, this issue doesn't happen.
> I checked the logs and the generator just seems to find no record in the 
> crawldb. It is like the crawldb wasn't available or the files are locked.
> *Here is the command used :*
> Sitemap :
> {code}
> hadoop jar  org.apache.nutch.crawl.Generator crawl_000_111/crawldb 
> crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm 
> -force
> {code}
> It returns as expected :
> {panel:title=Sitemap output}
> 2022-11-23 10:37:22,194 INFO util.SitemapProcessor: SitemapProcessor: Total 
> records rejected by filters: 0
> 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total 
> sitemaps from HostDb: 0
> 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total 
> sitemaps from seed urls: 1
> 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total 
> failed sitemap fetches: 0
> 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total 
> new sitemap entries added: 151
> {panel}
> Generetor :
> {code}
> hadoop jar  org.apache.nutch.crawl.Generator crawl_000_111/crawldb 
> crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm 
> -force
> {code}
> 1st time it returns :
> {code}
> 2022-11-23 11:25:15,202 WARN crawl.Generator: Generator: 0 records selected 
> for fetching, exiting ...
> {code}
> 2nd time it returns :
> {code}
> 2022-11-23 11:27:43,007 INFO crawl.Generator: Generator: Partitioning 
> selected urls for politeness.
> 2022-11-23 11:27:44,009 INFO crawl.Generator: Generator: segment: 
> crawl_000_111/segment/20221123112744
> ...
> 2022-11-23 11:28:34,061 INFO crawl.Generator: Generator: finished at 
> 2022-11-23 11:28:34, elapsed: 00:01:53
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2975) Generate 0 partition when used with sitemap

2022-11-24 Thread Lucas Pauchard (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638210#comment-17638210
 ] 

Lucas Pauchard commented on NUTCH-2975:
---

Hello Nagel, 

Thank you for the hint about the time. I found that some of our VMs were ahead 
of time.

Once I have resynchronize the time on these VMs, the issue didn't happen again.

I will therefore close this issue.

Thanks a lot.

> Generate 0 partition when used with sitemap
> ---
>
> Key: NUTCH-2975
> URL: https://issues.apache.org/jira/browse/NUTCH-2975
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, sitemap
>Affects Versions: 1.16
> Environment: Hadoop 3.1.1
>Reporter: Lucas Pauchard
>Priority: Major
>
> *Issue*
> We are facing strange issue since we have updated our Proxmox from 7.2-4 to 
> 7.2-11 which host the VMs/containers used for our Hadoop cluster.
> When we are using the sitemap component to add URLs, the generator process 
> doesn't work. It generates 0 partition.
> But if we call a second time the generator process, this time the generator 
> actually create a partition segment.
> It happens only when we use the sitemap process. If we use only the Injector 
> process, this issue doesn't happen.
> I checked the logs and the generator just seems to find no record in the 
> crawldb. It is like the crawldb wasn't available or the files are locked.
> *Here is the command used :*
> Sitemap :
> {code}
> hadoop jar  org.apache.nutch.crawl.Generator crawl_000_111/crawldb 
> crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm 
> -force
> {code}
> It returns as expected :
> {panel:title=Sitemap output}
> 2022-11-23 10:37:22,194 INFO util.SitemapProcessor: SitemapProcessor: Total 
> records rejected by filters: 0
> 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total 
> sitemaps from HostDb: 0
> 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total 
> sitemaps from seed urls: 1
> 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total 
> failed sitemap fetches: 0
> 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total 
> new sitemap entries added: 151
> {panel}
> Generetor :
> {code}
> hadoop jar  org.apache.nutch.crawl.Generator crawl_000_111/crawldb 
> crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm 
> -force
> {code}
> 1st time it returns :
> {code}
> 2022-11-23 11:25:15,202 WARN crawl.Generator: Generator: 0 records selected 
> for fetching, exiting ...
> {code}
> 2nd time it returns :
> {code}
> 2022-11-23 11:27:43,007 INFO crawl.Generator: Generator: Partitioning 
> selected urls for politeness.
> 2022-11-23 11:27:44,009 INFO crawl.Generator: Generator: segment: 
> crawl_000_111/segment/20221123112744
> ...
> 2022-11-23 11:28:34,061 INFO crawl.Generator: Generator: finished at 
> 2022-11-23 11:28:34, elapsed: 00:01:53
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2975) Generate 0 partition when used with sitemap

2022-11-23 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637810#comment-17637810
 ] 

Sebastian Nagel commented on NUTCH-2975:


The Generator log message only says that no records were selected. Could you
- provide the full logs, i.e. the Hadoop task logs
- look into the CrawlDb after SitemapProcessor was run and verify that it's 
definitely of size 0
- share the sitemap

... or at least some snippets. Otherwise it's hard to figure out the reason.

Just one assumption which could be the reason: SitemapProcessor writes the 
modification time, priority and fetch interval from the sitemap to the CrawlDb. 
This is only possible if you can trust these values. A modification time in the 
future, or a score of 0.0 (not larger than the default of the property 
generate.min.score) could cause that the URLs from the sitemap are not eligible 
for crawling.

> Generate 0 partition when used with sitemap
> ---
>
> Key: NUTCH-2975
> URL: https://issues.apache.org/jira/browse/NUTCH-2975
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, sitemap
>Affects Versions: 1.16
> Environment: Hadoop 3.1.1
>Reporter: Lucas Pauchard
>Priority: Major
>
> *Issue*
> We are facing strange issue since we have updated our Proxmox from 7.2-4 to 
> 7.2-11 which host the VMs/containers used for our Hadoop cluster.
> When we are using the sitemap component to add URLs, the generator process 
> doesn't work. It generates 0 partition.
> But if we call a second time the generator process, this time the generator 
> actually create a partition segment.
> It happens only when we use the sitemap process. If we use only the Injector 
> process, this issue doesn't happen.
> I checked the logs and the generator just seems to find no record in the 
> crawldb. It is like the crawldb wasn't available or the files are locked.
> *Here is the command used :*
> Sitemap :
> {code}
> hadoop jar  org.apache.nutch.crawl.Generator crawl_000_111/crawldb 
> crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm 
> -force
> {code}
> It returns as expected :
> {panel:title=Sitemap output}
> 2022-11-23 10:37:22,194 INFO util.SitemapProcessor: SitemapProcessor: Total 
> records rejected by filters: 0
> 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total 
> sitemaps from HostDb: 0
> 2022-11-23 10:37:22,195 INFO util.SitemapProcessor: SitemapProcessor: Total 
> sitemaps from seed urls: 1
> 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total 
> failed sitemap fetches: 0
> 2022-11-23 10:37:22,196 INFO util.SitemapProcessor: SitemapProcessor: Total 
> new sitemap entries added: 151
> {panel}
> Generetor :
> {code}
> hadoop jar  org.apache.nutch.crawl.Generator crawl_000_111/crawldb 
> crawl_000_111/segment -numFetchers 2 -mapper=1 -reducer=1 -noFilter -noNorm 
> -force
> {code}
> 1st time it returns :
> {code}
> 2022-11-23 11:25:15,202 WARN crawl.Generator: Generator: 0 records selected 
> for fetching, exiting ...
> {code}
> 2nd time it returns :
> {code}
> 2022-11-23 11:27:43,007 INFO crawl.Generator: Generator: Partitioning 
> selected urls for politeness.
> 2022-11-23 11:27:44,009 INFO crawl.Generator: Generator: segment: 
> crawl_000_111/segment/20221123112744
> ...
> 2022-11-23 11:28:34,061 INFO crawl.Generator: Generator: finished at 
> 2022-11-23 11:28:34, elapsed: 00:01:53
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)