[jira] [Created] (NUTCH-2326) Implement InvertLinks job in webui package

2016-10-17 Thread Sujen Shah (JIRA)
Sujen Shah created NUTCH-2326:
-

 Summary: Implement InvertLinks job in webui package
 Key: NUTCH-2326
 URL: https://issues.apache.org/jira/browse/NUTCH-2326
 Project: Nutch
  Issue Type: Task
  Components: REST_api, web gui
Affects Versions: 1.13
Reporter: Sujen Shah
Assignee: Sujen Shah






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-10-17 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1314:

Fix Version/s: (was: 2.5)
   2.4

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Assignee: Lewis John McGibbney
> Fix For: 2.4
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-10-17 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584402#comment-15584402
 ] 

Lewis John McGibbney commented on NUTCH-1314:
-

Yes [~wastl-nagel] that was why it was still open. Do you want to port?

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Assignee: Lewis John McGibbney
> Fix For: 2.4
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2327) Seeds injected in REST workflow must be ingested into HDFS

2016-10-17 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2327:
---

 Summary: Seeds injected in REST workflow must be ingested into HDFS
 Key: NUTCH-2327
 URL: https://issues.apache.org/jira/browse/NUTCH-2327
 Project: Nutch
  Issue Type: Improvement
  Components: injector, REST_api
Affects Versions: 1.12
Reporter: Lewis John McGibbney
 Fix For: 1.13


Right now when one uses the REST POST /seed/create API, a directory is created 
within /var/some/path/here which is create if you are working locally with the 
Nutch server e.g. on one machine. It is however not suitable for using the REST 
API in distributed deployments where seeds needs to be present within HDFS. 
More documentation on this topic is available at 
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI#Seed_List_creation
There are also various mailing list threads regarding use of the REST and this 
injector url issue described above needs to be addressed.

[~sujenshah] CC for context.

http://www.mail-archive.com/user%40nutch.apache.org/msg14922.html
http://www.mail-archive.com/user%40nutch.apache.org/msg14921.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2327) Seeds injected in REST workflow must be ingested into HDFS

2016-10-17 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584663#comment-15584663
 ] 

Sujen Shah commented on NUTCH-2327:
---

Thanks [~lewismc], I have already started working on this issue, I am testing 
it out on my local using a pseudo distributed mode. Had a few issues, should 
get through it soon.  

> Seeds injected in REST workflow must be ingested into HDFS
> --
>
> Key: NUTCH-2327
> URL: https://issues.apache.org/jira/browse/NUTCH-2327
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
> Fix For: 1.13
>
>
> Right now when one uses the REST POST /seed/create API, a directory is 
> created within /var/some/path/here which is create if you are working locally 
> with the Nutch server e.g. on one machine. It is however not suitable for 
> using the REST API in distributed deployments where seeds needs to be present 
> within HDFS. More documentation on this topic is available at 
> https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI#Seed_List_creation
> There are also various mailing list threads regarding use of the REST and 
> this injector url issue described above needs to be addressed.
> [~sujenshah] CC for context.
> http://www.mail-archive.com/user%40nutch.apache.org/msg14922.html
> http://www.mail-archive.com/user%40nutch.apache.org/msg14921.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)