Define automatic distribution of a closed term set over multiple fields in one 
field definition.
------------------------------------------------------------------------------------------------

                 Key: UIMA-2318
                 URL: https://issues.apache.org/jira/browse/UIMA-2318
             Project: UIMA
          Issue Type: Improvement
          Components: Sandbox-Lucas
            Reporter: Erik Faessler
            Priority: Minor


In the course of my work I needed LuCas to do a few things which were not 
possible or at least not too easy out-of-the-box. I checked out the latest 
LuCas version and adapted it to suit my needs.

The main extension arose from the following idea:

In the documents I want to index, gene names and identifiers are tagged (into a 
UIMA type 'Gene'). These identifiers are indexed so you can search for them. 
For faceting purposes I send these identifiers into a Lucene field named 
'facetTerms'. However, I have quite a whole lot of identifiers AND the 
identifiers are organized in multiple categories in my application. The best 
thing for me would be to have a single field for each of these categories, 
containing only gene identifiers belonging to this category.
This allows to easily obtain facet counts per category.

Now I have over 20 categories and I did not like the idea of a LuCas mapping 
file with 20 copies of nearly the same field definition.

So I allowed new attributes to a field element in the mapping file. These 
attributes would specify:

* A file determining the association between each possible term and its 
category (same format as hypernym file, so one term can belong to multiple 
categories);
* The naming scheme of the new fields;
* Whether to ignore the case when comparing the entries of the above mentioned 
file to the actual terms extracted from documents.

I wrote a class which realizes the distribution of the terms to their 
categories by creating the corresponding TokenStreams. Each TokenStream is 
supposed to let only those tokens pass which belong to its category. These 
tokens are determined by the association file described above. Thus we need the 
opposite of a StopWordFilter. I've added the 'SelectFilter' for this purpose. 
This filter mainly takes a set representing a closed vocabulary and lets tokens 
pass which are included in the set and denies other tokens (here comes the 
ignore option into play).

Another thing I did was to implement a RegExp replacement filter - it simply 
matches token string against a regular expression. On match the token string is 
replaced by a given replacement string (may include reg exp replacement 
characters like &).

Please note that the delivered patch file is not complete in terms of 
documentation, file headers etc. I would add these things if the changes are 
accepted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to