[Nutch-dev] [jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12453919 ] Sean Dean commented on NUTCH-233: - Could I suggest that this change, from .*(/.+?)/.*?\1/.*?\1/ to .*(/[^/]+)/[^/]+\1/[^/]+\1/ be committed to at least trunk for the time being. I recently created a segment with 1M urls exactly, I ran the fetch and it did indeed stall on the reduce part of the operation due to the regex filter. This was verified with a thread dump (kill -3 pid) on FreeBSD. I then made the suggested change in the config file and re-fetched the exact same segment. It completed without issue. I'm aware we might be losing some filtering functionality with this new expression, but is it not better then knowing there is always the chance your whole-web crawl fetch will fail because of this? wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] Stefan Groschupf commented on NUTCH-233: Hi Otis, yes for a serious whole web crawl I need to change this reg ex first. It only hangs with some random urls that for example comes from link farms the crawler runs into. wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] Stefan Groschupf commented on NUTCH-233: I think this should be fixed in .8 too, since everybody that does real whole web crawl with over a 100 Mio pages will run into this problem. The problems are for example from spam bot generated urls. wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9-dev Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370685 ] Jerome Charron commented on NUTCH-233: -- Stefan, I have created a small unit test for urlfilter-regexp and I doesn't notice any incompatibility in java.util.regex with this regexp. Could you please provide the urls that cause problem so that I can add them to me unit tests. Thanks Jérôme wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers