[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated NUTCH-1098:
---

Attachment: (was: patch-with-utf8-encoding.diff)

 better url-normalizer basic
 ---

 Key: NUTCH-1098
 URL: https://issues.apache.org/jira/browse/NUTCH-1098
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.3
 Environment: Any
Reporter: Radim Kolar
Assignee: Markus Jelsma
  Labels: encoding, url
 Fix For: 1.5

   Original Estimate: 4h
  Remaining Estimate: 4h

 Basic URL normalizer lacks 2 important features
 Encode space in URL into %20 to unbreak httpclient and possibly others who do 
 not expect space inside URL
 Ability to decode %33 encoding in URL. This is important for avoiding 
 duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1194) CrawlDB lock should be released earlier

2011-11-03 Thread Radim Kolar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated NUTCH-1194:
---

Comment: was deleted

(was: locking should be done in setup/cleanup task. Currently if you kill 
process submitting generate job to hadoop then crawl database will stay locked. 
It needs to be reworked: instead of running jobs one by one, submit them all at 
once and make them depends on each other. After jobs are placed in hadoop queue 
you can kill client without causing any bad effects.)

 CrawlDB lock should be released earlier
 ---

 Key: NUTCH-1194
 URL: https://issues.apache.org/jira/browse/NUTCH-1194
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5


 Lock on the CrawlDB is released when everything is finished. But when 
 generating many segments, the lock remains in place while it's not neccessary 
 anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately 
 after the selector has finished.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated NUTCH-1070:
---

Attachment: (was: nutch.bat)

 Run nutch under native windows (no cygwin)
 --

 Key: NUTCH-1070
 URL: https://issues.apache.org/jira/browse/NUTCH-1070
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.3
 Environment: Windows XP Home
Reporter: Radim Kolar
Priority: Minor
  Labels: windows

 Its possible to run Nutch in windows without cygwin. 
 1. Startup script needs to be ported from SH to BAT
 2. Because hadoop runs on unix only, we must emulate unix commands to make it 
 work. Luckily only chmod, bash and df needs to be emulated

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated NUTCH-1070:
---

Attachment: (was: bash.c)

 Run nutch under native windows (no cygwin)
 --

 Key: NUTCH-1070
 URL: https://issues.apache.org/jira/browse/NUTCH-1070
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.3
 Environment: Windows XP Home
Reporter: Radim Kolar
Priority: Minor
  Labels: windows

 Its possible to run Nutch in windows without cygwin. 
 1. Startup script needs to be ported from SH to BAT
 2. Because hadoop runs on unix only, we must emulate unix commands to make it 
 work. Luckily only chmod, bash and df needs to be emulated

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated NUTCH-1070:
---

Attachment: (was: chmod.c)

 Run nutch under native windows (no cygwin)
 --

 Key: NUTCH-1070
 URL: https://issues.apache.org/jira/browse/NUTCH-1070
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.3
 Environment: Windows XP Home
Reporter: Radim Kolar
Priority: Minor
  Labels: windows

 Its possible to run Nutch in windows without cygwin. 
 1. Startup script needs to be ported from SH to BAT
 2. Because hadoop runs on unix only, we must emulate unix commands to make it 
 work. Luckily only chmod, bash and df needs to be emulated

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-02 Thread Radim Kolar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated NUTCH-1098:
---

Attachment: (was: patch-urlnormalizer.diff)

 better url-normalizer basic
 ---

 Key: NUTCH-1098
 URL: https://issues.apache.org/jira/browse/NUTCH-1098
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.3
 Environment: Any
Reporter: Radim Kolar
Assignee: Markus Jelsma
  Labels: encoding, url
 Fix For: 1.5

   Original Estimate: 4h
  Remaining Estimate: 4h

 Basic URL normalizer lacks 2 important features
 Encode space in URL into %20 to unbreak httpclient and possibly others who do 
 not expect space inside URL
 Ability to decode %33 encoding in URL. This is important for avoiding 
 duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-02 Thread Radim Kolar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated NUTCH-1098:
---

Attachment: patch-with-utf8-encoding.diff

Added support for encoding string to UTF-8 and then URL %escaping it.

 better url-normalizer basic
 ---

 Key: NUTCH-1098
 URL: https://issues.apache.org/jira/browse/NUTCH-1098
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.3
 Environment: Any
Reporter: Radim Kolar
Assignee: Markus Jelsma
  Labels: encoding, url
 Fix For: 1.5

 Attachments: patch-with-utf8-encoding.diff

   Original Estimate: 4h
  Remaining Estimate: 4h

 Basic URL normalizer lacks 2 important features
 Encode space in URL into %20 to unbreak httpclient and possibly others who do 
 not expect space inside URL
 Ability to decode %33 encoding in URL. This is important for avoiding 
 duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-10-24 Thread Radim Kolar (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radim Kolar updated NUTCH-1098:
---

Attachment: (was: patch-urlnormalizer.diff)

 better url-normalizer basic
 ---

 Key: NUTCH-1098
 URL: https://issues.apache.org/jira/browse/NUTCH-1098
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.3
 Environment: Any
Reporter: Radim Kolar
Assignee: Markus Jelsma
  Labels: encoding, url
 Fix For: 1.5

 Attachments: patch-urlnormalizer.diff

   Original Estimate: 4h
  Remaining Estimate: 4h

 Basic URL normalizer lacks 2 important features
 Encode space in URL into %20 to unbreak httpclient and possibly others who do 
 not expect space inside URL
 Ability to decode %33 encoding in URL. This is important for avoiding 
 duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira