[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070485#comment-16070485
 ] 

Markus Jelsma commented on NUTCH-1465:
--

Hi Lewis!

It appears to be working fine now and bug-free due to not having the input 
overwrite existing CrawlDb entry interval and modified times because:
* that is messy in Nutch
* websites tend to set bad values, almost always, such as 100k large websites 
signaling to refetch everything daily

We have it deployed but not activated, that's the plan for early next week.

The patch is based on the mess in this thread's latest comments, and most 
recent scraps i found on Github. It should be the most recent contributions you 
guys added.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070468#comment-16070468
 ] 

Lewis John McGibbney commented on NUTCH-1465:
-

Fantastic [~markus17] is this working well for you? I am going to try this out. 
Out of curiosity, is this based off the the Github PR or the various patches 
which are associated with this issue? I am curious as I've seen quite a lot of 
variability in the implementations.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070323#comment-16070323
 ] 

Markus Jelsma edited comment on NUTCH-1465 at 6/30/17 3:58 PM:
---

Ah, removing the NULL check in the reducer solves the problem. The existing 
entries are no longer overwritten. This was visible with readdb -stats, showing 
an amount of records with status INJECTED.


was (Author: markus17):
Ah, removing the NULL check in the reducer solves the problem. The existing 
entries are no longer overwritten

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1465:
-
Attachment: NUTCH-1465.patch

Ah, removing the NULL check in the reducer solves the problem. The existing 
entries are no longer overwritten

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16070314#comment-16070314
 ] 

Markus Jelsma commented on NUTCH-1465:
--

There is an oddity going on when a sitemap.xml entry is listed twice. It then 
assumes the db_status INJECTED and overwrites existing CrawlDatum completely.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1465:
-
Attachment: NUTCH-1465.patch

Updated patch:

* corrected implementation for not overwriting existing entries
* CrawlDB is not emitted via MapOutputFormat instead of SequenceFileOutputFormat

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2017-06-30 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1465:
-
Attachment: NUTCH-1465.patch

Updated patch for trunk:

* added some curly braces to if statements, that kind of formatting always 
screws me at some point;
* added support for redirects, in hostdb mode, a url is built for url 
filtering, but the actual protocol can be https instead, so redirect;
* added support for defaulting to /sitemap.xml, some robots.txt do not properly 
point to the map
* added support for NOT OVERWRITING existing CrawlDatum information and made it 
the default option, letting external sitemap overwrite interval is a very bad 
idea.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2396) Cannot stop or abort fetch job via REST API

2017-06-30 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated NUTCH-2396:
--
Description: 
Case 1:
1) Run fetch job via REST API.
2) Send stop job  request.
3) Request finished with code 200 and returned string 'false'.
4) Job state changed to "STOPPING" and stopped only after finished *all* his 
work.

Case 2:
1) Run fetch job via REST API.
2) Send abort job request.
3) Request finished with code 200 and returned string 'false'.
4) Job state changed  to "KILLED", but job continue working and stopped after 
finished *all* his work.

  was:
Case 1:
1) Run fetch job via REST API.
2) Send stop job  request.
3) Request finished with code 200 and returned string 'false'.
4) Job state changed to "STOPPING" and stopped only after finished *all* his 
work.

Case 2:
1) Run fetch job via REST API.
2) Send abort job request.
3) Request finished with code 200 and returned string 'false'.
4) Job state changed  to "KILLED", but continue working and stopped after 
finished *all* his work.


> Cannot stop or abort fetch job via REST API
> ---
>
> Key: NUTCH-2396
> URL: https://issues.apache.org/jira/browse/NUTCH-2396
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.13
>Reporter: Sergey
>
> Case 1:
> 1) Run fetch job via REST API.
> 2) Send stop job  request.
> 3) Request finished with code 200 and returned string 'false'.
> 4) Job state changed to "STOPPING" and stopped only after finished *all* his 
> work.
> Case 2:
> 1) Run fetch job via REST API.
> 2) Send abort job request.
> 3) Request finished with code 200 and returned string 'false'.
> 4) Job state changed  to "KILLED", but job continue working and stopped after 
> finished *all* his work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2396) Cannot stop or abort fetch job via REST API

2017-06-30 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated NUTCH-2396:
--
Summary: Cannot stop or abort fetch job via REST API  (was: Cannot stop or 
abort FETCH job via REST API)

> Cannot stop or abort fetch job via REST API
> ---
>
> Key: NUTCH-2396
> URL: https://issues.apache.org/jira/browse/NUTCH-2396
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.13
>Reporter: Sergey
>
> Case 1:
> 1) Run fetch job via REST API.
> 2) Send stop job  request.
> 3) Request finished with code 200 and returned string 'false'.
> 4) Job state changed to "STOPPING" and stopped only after finished *all* his 
> work.
> Case 2:
> 1) Run fetch job via REST API.
> 2) Send abort job request.
> 3) Request finished with code 200 and returned string 'false'.
> 4) Job state changed  to "KILLED", but continue working and stopped after 
> finished *all* his work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2396) Cannot stop or abort FETCH job via REST API

2017-06-30 Thread Sergey (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated NUTCH-2396:
--
Summary: Cannot stop or abort FETCH job via REST API  (was: Cannot stop or 
abort fetcher job via REST API)

> Cannot stop or abort FETCH job via REST API
> ---
>
> Key: NUTCH-2396
> URL: https://issues.apache.org/jira/browse/NUTCH-2396
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.13
>Reporter: Sergey
>
> Case 1:
> 1) Run fetch job via REST API.
> 2) Send stop job  request.
> 3) Request finished with code 200 and returned string 'false'.
> 4) Job state changed to "STOPPING" and stopped only after finished *all* his 
> work.
> Case 2:
> 1) Run fetch job via REST API.
> 2) Send abort job request.
> 3) Request finished with code 200 and returned string 'false'.
> 4) Job state changed  to "KILLED", but continue working and stopped after 
> finished *all* his work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2396) Cannot stop or abort fetcher job via REST API

2017-06-30 Thread Sergey (JIRA)
Sergey created NUTCH-2396:
-

 Summary: Cannot stop or abort fetcher job via REST API
 Key: NUTCH-2396
 URL: https://issues.apache.org/jira/browse/NUTCH-2396
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.13
Reporter: Sergey


Case 1:
1) Run fetch job via REST API.
2) Send stop job  request.
3) Request finished with code 200 and returned string 'false'.
4) Job state changed to "STOPPING" and stopped only after finished *all* his 
work.

Case 2:
1) Run fetch job via REST API.
2) Send abort job request.
3) Request finished with code 200 and returned string 'false'.
4) Job state changed  to "KILLED", but continue working and stopped after 
finished *all* his work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)