[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12453919 ] 
            
Sean Dean commented on NUTCH-233:
---------------------------------

Could I suggest that this change, from ".*(/.+?)/.*?\1/.*?\1/" to 
".*(/[^/]+)/[^/]+\1/[^/]+\1/" be committed to at least trunk for the time being.

I recently created a segment with 1M urls exactly, I ran the fetch and it did 
indeed stall on the reduce part of the operation due to the regex filter. This 
was verified with a thread dump (kill -3 <pid>) on FreeBSD.

I then made the suggested change in the config file and re-fetched the exact 
same segment. It completed without issue.

I'm aware we might be losing some filtering functionality with this new 
expression, but is it not better then knowing there is always the chance your 
whole-web crawl fetch will fail because of this?

> wrong regular expression hang reduce process for ever
> -----------------------------------------------------
>
>                 Key: NUTCH-233
>                 URL: http://issues.apache.org/jira/browse/NUTCH-233
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Stefan Groschupf
>            Priority: Blocker
>             Fix For: 0.9.0
>
>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt 
> wasn't compatible with java.util.regex that is actually used in the regex url 
> filter. 
> May be it was missed to change it when the regular expression packages was 
> changed.
> The problem was that until reducing a fetch map output the reducer hangs 
> forever since the outputformat was applying the urlfilter a url that causes 
> the hang.
> 060315 230823 task_r_3n4zga     at 
> java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga     at 
> java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga     at 
> java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the 
> fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old 
> regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the 
> old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to