wrong regular expression hang reduce process for ever
------------------------------------------------------
Key: NUTCH-233
URL: http://issues.apache.org/jira/browse/NUTCH-233
Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8-dev
Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt
wasn't compatible with java.util.regex that is actually used in the regex url
filter.
May be it was missed to change it when the regular expression packages was
changed.
The problem was that until reducing a fetch map output the reducer hangs
forever since the outputformat was applying the urlfilter a url that causes the
hang.
060315 230823 task_r_3n4zga at
java.lang.Character.codePointAt(Character.java:2335)
060315 230823 task_r_3n4zga at
java.util.regex.Pattern$Dot.match(Pattern.java:4092)
060315 230823 task_r_3n4zga at
java.util.regex.Pattern$Curly.match1(Pattern.java:
I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the
fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
However may people can review it and can suggest improvements, since the old
regex would match :
"abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old
regex would also match :
"abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers