LuCas extension

Erik Fäßler Fri, 06 Jan 2012 03:30:33 -0800

Hello everyone,

In the course of my work I needed LuCas to do a few things which were not 
possible or at least not too easy out-of-the-box. I checked out the latest 
LuCas version and adapted it to suit my needs.
In this mail I'd like to describe the changes I have made. In the case you feel 
these changes should make their way back into the repository, I would be glad 
to share them. On the other hand it might be that some of the things I did 
already were possible before (I just could have missed that). For such cases I 
would refactor my changes, of course.


The main extension arose from the following idea:

In the documents I want to index, gene names and identifiers are tagged (into a 
UIMA type 'Gene'). These identifiers are indexed so you can search for them. 
For faceting purposes I send these identifiers into a Lucene field named 
'facetTerms'. However, I have quite a whole lot of identifiers AND the 
identifiers are organized in multiple categories in my application. The best 
thing for me would be to have a single field for each of these categories, 
containing only gene identifiers belonging to this category.
This allows to easily obtain facet counts per category.

Now I have over 20 categories and I did not like the idea of a LuCas mapping 
file with 20 copies of nearly the same field definition.

So I allowed new attributes to a field element in the mapping file. These 
attributes would specify

A file determining the association between each possible term and its category 
(same format as hypernym file, so one term can belong to multiple categories);
The naming scheme of the new fields;
Whether to ignore the case when comparing the entries of the above mentioned 
file to the actual terms extracted from documents.

I wrote a class which realizes the distribution of the terms to their 
categories by creating the corresponding TokenStreams. Each TokenStream is 
supposed to let only those tokens pass which belong to its category. These 
tokens are determined by the association file described above. Thus we need the 
opposite of a StopWordFilter. I've added the 'SelectFilter' for this purpose. 
This filter mainly takes a set representing a closed vocabulary and lets tokens 
pass which are included in the set and denies other tokens (here comes the 
ignore option into play).

Another thing I did was to implement a RegExp replacement filter - it simply 
matches token string against a regular expression. On match the token string is 
replaced by a given replacement string (may include reg exp replacement 
characters like &).

If you would agree that these things are good extensions to LuCas I would be 
glad to add some documentation and create a patch file which I would send to 
one of the committers.

Best regards,

        Erik

LuCas extension

Reply via email to