Hello everyone,
In the course of my work I needed LuCas to do a few things which were not
possible or at least not too easy out-of-the-box. I checked out the latest
LuCas version and adapted it to suit my needs.
In this mail I'd like to describe the changes I have made. In the case you feel
these changes should make their way back into the repository, I would be glad
to share them. On the other hand it might be that some of the things I did
already were possible before (I just could have missed that). For such cases I
would refactor my changes, of course.
The main extension arose from the following idea:
In the documents I want to index, gene names and identifiers are tagged (into a
UIMA type 'Gene'). These identifiers are indexed so you can search for them.
For faceting purposes I send these identifiers into a Lucene field named
'facetTerms'. However, I have quite a whole lot of identifiers AND the
identifiers are organized in multiple categories in my application. The best
thing for me would be to have a single field for each of these categories,
containing only gene identifiers belonging to this category.
This allows to easily obtain facet counts per category.
Now I have over 20 categories and I did not like the idea of a LuCas mapping
file with 20 copies of nearly the same field definition.
So I allowed new attributes to a field element in the mapping file. These
attributes would specify
A file determining the association between each possible term and its category
(same format as hypernym file, so one term can belong to multiple categories);
The naming scheme of the new fields;
Whether to ignore the case when comparing the entries of the above mentioned
file to the actual terms extracted from documents.
I wrote a class which realizes the distribution of the terms to their
categories by creating the corresponding TokenStreams. Each TokenStream is
supposed to let only those tokens pass which belong to its category. These
tokens are determined by the association file described above. Thus we need the
opposite of a StopWordFilter. I've added the 'SelectFilter' for this purpose.
This filter mainly takes a set representing a closed vocabulary and lets tokens
pass which are included in the set and denies other tokens (here comes the
ignore option into play).
Another thing I did was to implement a RegExp replacement filter - it simply
matches token string against a regular expression. On match the token string is
replaced by a given replacement string (may include reg exp replacement
characters like &).
If you would agree that these things are good extensions to LuCas I would be
glad to add some documentation and create a patch file which I would send to
one of the committers.
Best regards,
Erik