[jira] [Created] (NUTCH-2326) Implement InvertLinks job in webui package
Sujen Shah created NUTCH-2326: - Summary: Implement InvertLinks job in webui package Key: NUTCH-2326 URL: https://issues.apache.org/jira/browse/NUTCH-2326 Project: Nutch Issue Type: Task Components: REST_api, web gui Affects Versions: 1.13 Reporter: Sujen Shah Assignee: Sujen Shah -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1314: Fix Version/s: (was: 2.5) 2.4 > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Assignee: Lewis John McGibbney > Fix For: 2.4 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584402#comment-15584402 ] Lewis John McGibbney commented on NUTCH-1314: - Yes [~wastl-nagel] that was why it was still open. Do you want to port? > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Assignee: Lewis John McGibbney > Fix For: 2.4 > > Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, > NUTCH-1314-v3.patch, NUTCH-1314-v4.patch, NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2327) Seeds injected in REST workflow must be ingested into HDFS
Lewis John McGibbney created NUTCH-2327: --- Summary: Seeds injected in REST workflow must be ingested into HDFS Key: NUTCH-2327 URL: https://issues.apache.org/jira/browse/NUTCH-2327 Project: Nutch Issue Type: Improvement Components: injector, REST_api Affects Versions: 1.12 Reporter: Lewis John McGibbney Fix For: 1.13 Right now when one uses the REST POST /seed/create API, a directory is created within /var/some/path/here which is create if you are working locally with the Nutch server e.g. on one machine. It is however not suitable for using the REST API in distributed deployments where seeds needs to be present within HDFS. More documentation on this topic is available at https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI#Seed_List_creation There are also various mailing list threads regarding use of the REST and this injector url issue described above needs to be addressed. [~sujenshah] CC for context. http://www.mail-archive.com/user%40nutch.apache.org/msg14922.html http://www.mail-archive.com/user%40nutch.apache.org/msg14921.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2327) Seeds injected in REST workflow must be ingested into HDFS
[ https://issues.apache.org/jira/browse/NUTCH-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584663#comment-15584663 ] Sujen Shah commented on NUTCH-2327: --- Thanks [~lewismc], I have already started working on this issue, I am testing it out on my local using a pseudo distributed mode. Had a few issues, should get through it soon. > Seeds injected in REST workflow must be ingested into HDFS > -- > > Key: NUTCH-2327 > URL: https://issues.apache.org/jira/browse/NUTCH-2327 > Project: Nutch > Issue Type: Improvement > Components: injector, REST_api >Affects Versions: 1.12 >Reporter: Lewis John McGibbney > Fix For: 1.13 > > > Right now when one uses the REST POST /seed/create API, a directory is > created within /var/some/path/here which is create if you are working locally > with the Nutch server e.g. on one machine. It is however not suitable for > using the REST API in distributed deployments where seeds needs to be present > within HDFS. More documentation on this topic is available at > https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI#Seed_List_creation > There are also various mailing list threads regarding use of the REST and > this injector url issue described above needs to be addressed. > [~sujenshah] CC for context. > http://www.mail-archive.com/user%40nutch.apache.org/msg14922.html > http://www.mail-archive.com/user%40nutch.apache.org/msg14921.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)