[
https://issues.apache.org/jira/browse/UIMA-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Erik Faessler updated UIMA-2318:
--------------------------------
Attachment: termsToFieldsDistr.patch
Uploaded first patch version (delivers functionality but no documentation about
the function).
> Define automatic distribution of a closed term set over multiple fields in
> one field definition.
> ------------------------------------------------------------------------------------------------
>
> Key: UIMA-2318
> URL: https://issues.apache.org/jira/browse/UIMA-2318
> Project: UIMA
> Issue Type: Improvement
> Components: Sandbox-Lucas
> Reporter: Erik Faessler
> Priority: Minor
> Attachments: termsToFieldsDistr.patch
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> In the course of my work I needed LuCas to do a few things which were not
> possible or at least not too easy out-of-the-box. I checked out the latest
> LuCas version and adapted it to suit my needs.
> The main extension arose from the following idea:
> In the documents I want to index, gene names and identifiers are tagged (into
> a UIMA type 'Gene'). These identifiers are indexed so you can search for
> them. For faceting purposes I send these identifiers into a Lucene field
> named 'facetTerms'. However, I have quite a whole lot of identifiers AND the
> identifiers are organized in multiple categories in my application. The best
> thing for me would be to have a single field for each of these categories,
> containing only gene identifiers belonging to this category.
> This allows to easily obtain facet counts per category.
> Now I have over 20 categories and I did not like the idea of a LuCas mapping
> file with 20 copies of nearly the same field definition.
> So I allowed new attributes to a field element in the mapping file. These
> attributes would specify:
> * A file determining the association between each possible term and its
> category (same format as hypernym file, so one term can belong to multiple
> categories);
> * The naming scheme of the new fields;
> * Whether to ignore the case when comparing the entries of the above
> mentioned file to the actual terms extracted from documents.
> I wrote a class which realizes the distribution of the terms to their
> categories by creating the corresponding TokenStreams. Each TokenStream is
> supposed to let only those tokens pass which belong to its category. These
> tokens are determined by the association file described above. Thus we need
> the opposite of a StopWordFilter. I've added the 'SelectFilter' for this
> purpose. This filter mainly takes a set representing a closed vocabulary and
> lets tokens pass which are included in the set and denies other tokens (here
> comes the ignore option into play).
> Another thing I did was to implement a RegExp replacement filter - it simply
> matches token string against a regular expression. On match the token string
> is replaced by a given replacement string (may include reg exp replacement
> characters like &).
> Please note that the delivered patch file is not complete in terms of
> documentation, file headers etc. I would add these things if the changes are
> accepted.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira