Great stuff, Rodrigo! Welcome. Your comments are right on the mark. While Lucene has a great architecture for building flexible text processing systems, the supplied tokenizers and analyzers aren't perfect. Fortunately, its easy to add new ones.
> Well, in fact my main point is the following : having one filter per > language is wrong. Second point is: having the filter algorithm hard-coded > in a programming language is wrong as well. There should be a simple way of > specifying a filter in a simple, dedicated language. In this way, the > snowball project is really interesting as it solves the issue. In my mind, > there should be mainly a normalizer engine, with many configuration files, > easy to modify to implement or adapt a filter. This is an important issue, > as the accuracy of the search engine is directly linked to the normalization > strategy. I'm all for domain-specific languages, but you have to be careful of making the filter language too easy to change, since if the filter is changed after the archive is created and documents indexed, searches will stop working. So any such filtering language should produce code (or data) that becomes part of the program, rather than simply a configuration file along with the program. In other words, it should be considered source code, not configuration data. > Before going on the process of submitting it to the lucene project, > I'd like to hear your comments on the approach. Of high concern is > the language used to describe the normalization process, as I am not > plenty satisfied of it, but hey it's hard to find something really > simple yet just expressive enough. Great idea! We'd love to have something like this. This is the sort of contribution we're really looking for. I'm willing to help write a parser for it if the langauge gets complicated. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
