[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070485#comment-16070485 ] Markus Jelsma commented on NUTCH-1465: -- Hi Lewis! It appears to be working fine now and bug-free due to not having the input overwrite existing CrawlDb entry interval and modified times because: * that is messy in Nutch * websites tend to set bad values, almost always, such as 100k large websites signaling to refetch everything daily We have it deployed but not activated, that's the plan for early next week. The patch is based on the mess in this thread's latest comments, and most recent scraps i found on Github. It should be the most recent contributions you guys added. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070468#comment-16070468 ] Lewis John McGibbney commented on NUTCH-1465: - Fantastic [~markus17] is this working well for you? I am going to try this out. Out of curiosity, is this based off the the Github PR or the various patches which are associated with this issue? I am curious as I've seen quite a lot of variability in the implementations. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070323#comment-16070323 ] Markus Jelsma edited comment on NUTCH-1465 at 6/30/17 3:58 PM: --- Ah, removing the NULL check in the reducer solves the problem. The existing entries are no longer overwritten. This was visible with readdb -stats, showing an amount of records with status INJECTED. was (Author: markus17): Ah, removing the NULL check in the reducer solves the problem. The existing entries are no longer overwritten > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Attachment: NUTCH-1465.patch Ah, removing the NULL check in the reducer solves the problem. The existing entries are no longer overwritten > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070314#comment-16070314 ] Markus Jelsma commented on NUTCH-1465: -- There is an oddity going on when a sitemap.xml entry is listed twice. It then assumes the db_status INJECTED and overwrites existing CrawlDatum completely. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Attachment: NUTCH-1465.patch Updated patch: * corrected implementation for not overwriting existing entries * CrawlDB is not emitted via MapOutputFormat instead of SequenceFileOutputFormat > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1465: - Attachment: NUTCH-1465.patch Updated patch for trunk: * added some curly braces to if statements, that kind of formatting always screws me at some point; * added support for redirects, in hostdb mode, a url is built for url filtering, but the actual protocol can be https instead, so redirect; * added support for defaulting to /sitemap.xml, some robots.txt do not properly point to the map * added support for NOT OVERWRITING existing CrawlDatum information and made it the default option, letting external sitemap overwrite interval is a very bad idea. > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-1465.patch, > NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, > NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, > NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2396) Cannot stop or abort fetch job via REST API
[ https://issues.apache.org/jira/browse/NUTCH-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey updated NUTCH-2396: -- Description: Case 1: 1) Run fetch job via REST API. 2) Send stop job request. 3) Request finished with code 200 and returned string 'false'. 4) Job state changed to "STOPPING" and stopped only after finished *all* his work. Case 2: 1) Run fetch job via REST API. 2) Send abort job request. 3) Request finished with code 200 and returned string 'false'. 4) Job state changed to "KILLED", but job continue working and stopped after finished *all* his work. was: Case 1: 1) Run fetch job via REST API. 2) Send stop job request. 3) Request finished with code 200 and returned string 'false'. 4) Job state changed to "STOPPING" and stopped only after finished *all* his work. Case 2: 1) Run fetch job via REST API. 2) Send abort job request. 3) Request finished with code 200 and returned string 'false'. 4) Job state changed to "KILLED", but continue working and stopped after finished *all* his work. > Cannot stop or abort fetch job via REST API > --- > > Key: NUTCH-2396 > URL: https://issues.apache.org/jira/browse/NUTCH-2396 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.13 >Reporter: Sergey > > Case 1: > 1) Run fetch job via REST API. > 2) Send stop job request. > 3) Request finished with code 200 and returned string 'false'. > 4) Job state changed to "STOPPING" and stopped only after finished *all* his > work. > Case 2: > 1) Run fetch job via REST API. > 2) Send abort job request. > 3) Request finished with code 200 and returned string 'false'. > 4) Job state changed to "KILLED", but job continue working and stopped after > finished *all* his work. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2396) Cannot stop or abort fetch job via REST API
[ https://issues.apache.org/jira/browse/NUTCH-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey updated NUTCH-2396: -- Summary: Cannot stop or abort fetch job via REST API (was: Cannot stop or abort FETCH job via REST API) > Cannot stop or abort fetch job via REST API > --- > > Key: NUTCH-2396 > URL: https://issues.apache.org/jira/browse/NUTCH-2396 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.13 >Reporter: Sergey > > Case 1: > 1) Run fetch job via REST API. > 2) Send stop job request. > 3) Request finished with code 200 and returned string 'false'. > 4) Job state changed to "STOPPING" and stopped only after finished *all* his > work. > Case 2: > 1) Run fetch job via REST API. > 2) Send abort job request. > 3) Request finished with code 200 and returned string 'false'. > 4) Job state changed to "KILLED", but continue working and stopped after > finished *all* his work. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2396) Cannot stop or abort FETCH job via REST API
[ https://issues.apache.org/jira/browse/NUTCH-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey updated NUTCH-2396: -- Summary: Cannot stop or abort FETCH job via REST API (was: Cannot stop or abort fetcher job via REST API) > Cannot stop or abort FETCH job via REST API > --- > > Key: NUTCH-2396 > URL: https://issues.apache.org/jira/browse/NUTCH-2396 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.13 >Reporter: Sergey > > Case 1: > 1) Run fetch job via REST API. > 2) Send stop job request. > 3) Request finished with code 200 and returned string 'false'. > 4) Job state changed to "STOPPING" and stopped only after finished *all* his > work. > Case 2: > 1) Run fetch job via REST API. > 2) Send abort job request. > 3) Request finished with code 200 and returned string 'false'. > 4) Job state changed to "KILLED", but continue working and stopped after > finished *all* his work. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2396) Cannot stop or abort fetcher job via REST API
Sergey created NUTCH-2396: - Summary: Cannot stop or abort fetcher job via REST API Key: NUTCH-2396 URL: https://issues.apache.org/jira/browse/NUTCH-2396 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.13 Reporter: Sergey Case 1: 1) Run fetch job via REST API. 2) Send stop job request. 3) Request finished with code 200 and returned string 'false'. 4) Job state changed to "STOPPING" and stopped only after finished *all* his work. Case 2: 1) Run fetch job via REST API. 2) Send abort job request. 3) Request finished with code 200 and returned string 'false'. 4) Job state changed to "KILLED", but continue working and stopped after finished *all* his work. -- This message was sent by Atlassian JIRA (v6.4.14#64029)