[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-08-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517232
 ] 

Doğacan Güney commented on NUTCH-522:
-

 I tried with protocol-http and protocol-httpclient, i got the same error when 
 the url contained some space.
 I'm afraid it didn't change anything. 

Actually, it is good news :). This means we can update the url pattern to 
exclude urls with spaces in it.

 I think you're right about the order, the normalizer should come first.

Btw, this is already what we do in ParseOutputFormat. Urls are normalized in 
Outlink's constructor, then validated and filtered in ParseOutputFormat. 

So, I am going to reverse validator/normalizer order in your patch and commit 
it soon.

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515991
 ] 

Doğacan Güney commented on NUTCH-522:
-

Btw, I though about validation stuff a bit and IMHO it is better to run 
normalizers before UrlValidator (so the new order is normalize, validate, 
filter). It is possible that someone writes a normalizer that replaces spaces 
with %20s (so it becomes a valid url). If we have such a normalizer, we should 
run it before validation so that it will pass validation (and IMO, it should 
pass validation since nutch can fetch a url with %20's)

I think your patch looks good, but I will wait a while to hopefully get some 
comments on putting normalizers before validator.

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515992
 ] 

Doğacan Güney commented on NUTCH-522:
-

I forgot to ask: are you using protocol-http or protocol-httpclient? It is 
possible that httpclient does some sort of normalization before requesting a 
url, so (maybe) it can fetch a url like:

http://autos.yahoo.com/carfinder/?bodystyle=CPEfuel=Gasexpanded=bodystyle; 
expanded=fuel

or maybe it can't :) .

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516138
 ] 

Emmanuel Joke commented on NUTCH-522:
-

I tried with protocol-http and protocol-httpclient, i got the same error when 
the url contained some space.
I'm afraid it didn't change anything.

I think you're right about the order, the normalizer should come first.

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514144
 ] 

Doğacan Güney commented on NUTCH-522:
-

 Oops, my mistake. Please find an updated patch. 

This patch looks good.

 For instance: http://lucene.apache.org/jira/browse.jsp?itemid=500 sort=up
 A space between 500 and  has been accepted.
 Is it normal ? 

 I really want to exclude thos kind of URL. 

UrlValidator is meant to eliminate anything nutch can't fetch. So, if fetcher 
fails while trying to fetch that url, that UrlValidator should have eliminated 
it and it is a bug.

[...snip...]
 It includes an option to disallow FRAGMENTS. Why don't we have this version 
 in nutch ?

Because urlfilters can already do that, so I didn't want to duplicate 
functionality. UrlValidator eliminates invalid urls, then urlnormalizers and 
urlfilters decide what to do with it. You can remove fragments or skip url with 
fragments.

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-20 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514153
 ] 

Emmanuel Joke commented on NUTCH-522:
-

Actually I tried to fetch the url 
http://autos.yahoo.com/carfinder/?bodystyle=CPE fuel=Gas expanded=bodystyle 
expanded=fuel and it didn't work within Nutch.

But if you remove the space: 
http://autos.yahoo.com/carfinder/?bodystyle=CPEfuel=Gasexpanded=bodystyleexpanded=fuel,
 it does work perfectly.

So, I guess we have to add a new check regarding space in URLs. any idea ?

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514157
 ] 

Doğacan Güney commented on NUTCH-522:
-

 So, I guess we have to add a new check regarding space in URLs. any idea ?

OK, it is a bug then. 

I would suggest that you add a main method to UrlValidator ( like this one: 
http://www.ceng.metu.edu.tr/~e1345172/validator_main.patch ), then debug 
UrlValidator to check why it accepts it. Also, if commons-validator's 
UrlValidator filters that url, you can debug original UrlValidator to see where 
it invalidates it.

My guess is that it may be related to LEGAL_ASCII_PATTERN. I couldn't get 
original validator's LEGAL_ASCII_PATTERN to work with java.util.regex so I 
wrote a new pattern, but I thought the new pattern was stricter than the old 
one.

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513895
 ] 

Doğacan Güney commented on NUTCH-522:
-

I like the idea, but your patch seems to have a bug. Now injector only injects 
a url if it is *not* valid.  

Injector.java:75:  if (!validator.isValid(url)) {

I think you should put a return there instead of moving normalizing and 
filtering code into that branch.

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers