[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12453919 ]
Sean Dean commented on NUTCH-233:
-
Could I suggest that this change, from .*(/.+?)/.*?\1/.*?\1/ to
.*(/[^/]+)/[^/]+\1/[^/]+\1/ be committed to at least trunk for
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ]
Stefan Groschupf commented on NUTCH-233:
Hi Otis,
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ]
Stefan Groschupf commented on NUTCH-233:
I think this should be fixed in .8 too, since everybody that does real whole
web crawl with over a 100 Mio pages
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370685 ]
Jerome Charron commented on NUTCH-233:
--
Stefan,
I have created a small unit test for urlfilter-regexp and I doesn't notice any
incompatibility in java.util.regex with