Jérôme Charron wrote:
I forgot to add: it is also slightly incompatible with Perl regex (and
consequently with Java regex), I don't remember the details, they are
somewhere in the docs, but the incompatibility is caused by some rarely
used operators being not implemented... so I guess we could live with it.
I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $ (which are
used in nutch)
Of course, regexp can be easily rewritten. But how will we face to this if
we switch to dk.brics.automaton?
* Change the syntax used in Nutch?
* Convert perl syntax to dk.brics.automaton?
* simulate ^ and $ operators by prepending and appending special start
and end markers to the input string.
E.g.
String START = "__START__";
String END = "__END__";
inputString = START + inputString + END;
I will make some benchs based on regexp-urlfilter to evaluate if
dk.brics.automaton integration in Nutch is
interesting (facing the needed changes in the regexp syntax)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers