Jérôme Charron wrote:
I forgot to add: it is also slightly incompatible with Perl regex (and
consequently with Java regex), I don't remember the details, they are
somewhere in the docs, but the incompatibility is caused by some rarely
used operators being not implemented... so I guess we could live with it.

I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the  Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $ (which are
used in nutch)
Of course, regexp can be easily rewritten. But how will we face to this if
we switch to dk.brics.automaton?
* Change the syntax used in Nutch?
* Convert perl syntax to dk.brics.automaton?

* simulate ^ and $ operators by prepending and appending special start and end markers to the input string.

E.g.
   String START = "__START__";
   String END = "__END__";
   inputString = START + inputString + END;

I will make some benchs based on regexp-urlfilter to evaluate if
dk.brics.automaton integration in Nutch is
interesting (facing the needed changes in the regexp syntax)

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to