Define automatic distribution of a closed term set over multiple fields in one
field definition.
------------------------------------------------------------------------------------------------
Key: UIMA-2318
URL: https://issues.apache.org/jira/browse/UIMA-2318
Project: UIMA
Issue Type: Improvement
Components: Sandbox-Lucas
Reporter: Erik Faessler
Priority: Minor
In the course of my work I needed LuCas to do a few things which were not
possible or at least not too easy out-of-the-box. I checked out the latest
LuCas version and adapted it to suit my needs.
The main extension arose from the following idea:
In the documents I want to index, gene names and identifiers are tagged (into a
UIMA type 'Gene'). These identifiers are indexed so you can search for them.
For faceting purposes I send these identifiers into a Lucene field named
'facetTerms'. However, I have quite a whole lot of identifiers AND the
identifiers are organized in multiple categories in my application. The best
thing for me would be to have a single field for each of these categories,
containing only gene identifiers belonging to this category.
This allows to easily obtain facet counts per category.
Now I have over 20 categories and I did not like the idea of a LuCas mapping
file with 20 copies of nearly the same field definition.
So I allowed new attributes to a field element in the mapping file. These
attributes would specify:
* A file determining the association between each possible term and its
category (same format as hypernym file, so one term can belong to multiple
categories);
* The naming scheme of the new fields;
* Whether to ignore the case when comparing the entries of the above mentioned
file to the actual terms extracted from documents.
I wrote a class which realizes the distribution of the terms to their
categories by creating the corresponding TokenStreams. Each TokenStream is
supposed to let only those tokens pass which belong to its category. These
tokens are determined by the association file described above. Thus we need the
opposite of a StopWordFilter. I've added the 'SelectFilter' for this purpose.
This filter mainly takes a set representing a closed vocabulary and lets tokens
pass which are included in the set and denies other tokens (here comes the
ignore option into play).
Another thing I did was to implement a RegExp replacement filter - it simply
matches token string against a regular expression. On match the token string is
replaced by a given replacement string (may include reg exp replacement
characters like &).
Please note that the delivered patch file is not complete in terms of
documentation, file headers etc. I would add these things if the changes are
accepted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira