wrong regular expression hang reduce process for ever 
------------------------------------------------------

         Key: NUTCH-233
         URL: http://issues.apache.org/jira/browse/NUTCH-233
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
    Reporter: Stefan Groschupf
    Priority: Blocker
     Fix For: 0.8-dev


Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt 
wasn't compatible with java.util.regex that is actually used in the regex url 
filter. 
May be it was missed to change it when the regular expression packages was 
changed.
The problem was that until reducing a fetch map output the reducer hangs 
forever since the outputformat was applying the urlfilter a url that causes the 
hang.
060315 230823 task_r_3n4zga     at 
java.lang.Character.codePointAt(Character.java:2335)
060315 230823 task_r_3n4zga     at 
java.util.regex.Pattern$Dot.match(Pattern.java:4092)
060315 230823 task_r_3n4zga     at 
java.util.regex.Pattern$Curly.match1(Pattern.java:

I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the 
fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
However may people can review it and can suggest improvements, since the old 
regex would match :
"abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old 
regex would also match :
"abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to