[jira] [Updated] (NUTCH-1098) better url-normalizer basic
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-with-utf8-encoding.diff) better url-normalizer basic --- Key: NUTCH-1098 URL: https://issues.apache.org/jira/browse/NUTCH-1098 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.3 Environment: Any Reporter: Radim Kolar Assignee: Markus Jelsma Labels: encoding, url Fix For: 1.5 Original Estimate: 4h Remaining Estimate: 4h Basic URL normalizer lacks 2 important features Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL Ability to decode %33 encoding in URL. This is important for avoiding duplicates -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1194) CrawlDB lock should be released earlier
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1194: --- Comment: was deleted (was: locking should be done in setup/cleanup task. Currently if you kill process submitting generate job to hadoop then crawl database will stay locked. It needs to be reworked: instead of running jobs one by one, submit them all at once and make them depends on each other. After jobs are placed in hadoop queue you can kill client without causing any bad effects.) CrawlDB lock should be released earlier --- Key: NUTCH-1194 URL: https://issues.apache.org/jira/browse/NUTCH-1194 Project: Nutch Issue Type: Improvement Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Lock on the CrawlDB is released when everything is finished. But when generating many segments, the lock remains in place while it's not neccessary anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately after the selector has finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: nutch.bat) Run nutch under native windows (no cygwin) -- Key: NUTCH-1070 URL: https://issues.apache.org/jira/browse/NUTCH-1070 Project: Nutch Issue Type: New Feature Affects Versions: 1.3 Environment: Windows XP Home Reporter: Radim Kolar Priority: Minor Labels: windows Its possible to run Nutch in windows without cygwin. 1. Startup script needs to be ported from SH to BAT 2. Because hadoop runs on unix only, we must emulate unix commands to make it work. Luckily only chmod, bash and df needs to be emulated -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: bash.c) Run nutch under native windows (no cygwin) -- Key: NUTCH-1070 URL: https://issues.apache.org/jira/browse/NUTCH-1070 Project: Nutch Issue Type: New Feature Affects Versions: 1.3 Environment: Windows XP Home Reporter: Radim Kolar Priority: Minor Labels: windows Its possible to run Nutch in windows without cygwin. 1. Startup script needs to be ported from SH to BAT 2. Because hadoop runs on unix only, we must emulate unix commands to make it work. Luckily only chmod, bash and df needs to be emulated -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: chmod.c) Run nutch under native windows (no cygwin) -- Key: NUTCH-1070 URL: https://issues.apache.org/jira/browse/NUTCH-1070 Project: Nutch Issue Type: New Feature Affects Versions: 1.3 Environment: Windows XP Home Reporter: Radim Kolar Priority: Minor Labels: windows Its possible to run Nutch in windows without cygwin. 1. Startup script needs to be ported from SH to BAT 2. Because hadoop runs on unix only, we must emulate unix commands to make it work. Luckily only chmod, bash and df needs to be emulated -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1098) better url-normalizer basic
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-urlnormalizer.diff) better url-normalizer basic --- Key: NUTCH-1098 URL: https://issues.apache.org/jira/browse/NUTCH-1098 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.3 Environment: Any Reporter: Radim Kolar Assignee: Markus Jelsma Labels: encoding, url Fix For: 1.5 Original Estimate: 4h Remaining Estimate: 4h Basic URL normalizer lacks 2 important features Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL Ability to decode %33 encoding in URL. This is important for avoiding duplicates -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1098) better url-normalizer basic
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: patch-with-utf8-encoding.diff Added support for encoding string to UTF-8 and then URL %escaping it. better url-normalizer basic --- Key: NUTCH-1098 URL: https://issues.apache.org/jira/browse/NUTCH-1098 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.3 Environment: Any Reporter: Radim Kolar Assignee: Markus Jelsma Labels: encoding, url Fix For: 1.5 Attachments: patch-with-utf8-encoding.diff Original Estimate: 4h Remaining Estimate: 4h Basic URL normalizer lacks 2 important features Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL Ability to decode %33 encoding in URL. This is important for avoiding duplicates -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1098) better url-normalizer basic
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-urlnormalizer.diff) better url-normalizer basic --- Key: NUTCH-1098 URL: https://issues.apache.org/jira/browse/NUTCH-1098 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.3 Environment: Any Reporter: Radim Kolar Assignee: Markus Jelsma Labels: encoding, url Fix For: 1.5 Attachments: patch-urlnormalizer.diff Original Estimate: 4h Remaining Estimate: 4h Basic URL normalizer lacks 2 important features Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL Ability to decode %33 encoding in URL. This is important for avoiding duplicates -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira