[ 
https://issues.apache.org/jira/browse/UIMA-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Faessler updated UIMA-2318:
--------------------------------

    Attachment: termsToFieldsDistr.patch

Uploaded first patch version (delivers functionality but no documentation about 
the function).
                
> Define automatic distribution of a closed term set over multiple fields in 
> one field definition.
> ------------------------------------------------------------------------------------------------
>
>                 Key: UIMA-2318
>                 URL: https://issues.apache.org/jira/browse/UIMA-2318
>             Project: UIMA
>          Issue Type: Improvement
>          Components: Sandbox-Lucas
>            Reporter: Erik Faessler
>            Priority: Minor
>         Attachments: termsToFieldsDistr.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In the course of my work I needed LuCas to do a few things which were not 
> possible or at least not too easy out-of-the-box. I checked out the latest 
> LuCas version and adapted it to suit my needs.
> The main extension arose from the following idea:
> In the documents I want to index, gene names and identifiers are tagged (into 
> a UIMA type 'Gene'). These identifiers are indexed so you can search for 
> them. For faceting purposes I send these identifiers into a Lucene field 
> named 'facetTerms'. However, I have quite a whole lot of identifiers AND the 
> identifiers are organized in multiple categories in my application. The best 
> thing for me would be to have a single field for each of these categories, 
> containing only gene identifiers belonging to this category.
> This allows to easily obtain facet counts per category.
> Now I have over 20 categories and I did not like the idea of a LuCas mapping 
> file with 20 copies of nearly the same field definition.
> So I allowed new attributes to a field element in the mapping file. These 
> attributes would specify:
> * A file determining the association between each possible term and its 
> category (same format as hypernym file, so one term can belong to multiple 
> categories);
> * The naming scheme of the new fields;
> * Whether to ignore the case when comparing the entries of the above 
> mentioned file to the actual terms extracted from documents.
> I wrote a class which realizes the distribution of the terms to their 
> categories by creating the corresponding TokenStreams. Each TokenStream is 
> supposed to let only those tokens pass which belong to its category. These 
> tokens are determined by the association file described above. Thus we need 
> the opposite of a StopWordFilter. I've added the 'SelectFilter' for this 
> purpose. This filter mainly takes a set representing a closed vocabulary and 
> lets tokens pass which are included in the set and denies other tokens (here 
> comes the ignore option into play).
> Another thing I did was to implement a RegExp replacement filter - it simply 
> matches token string against a regular expression. On match the token string 
> is replaced by a given replacement string (may include reg exp replacement 
> characters like &).
> Please note that the delivered patch file is not complete in terms of 
> documentation, file headers etc. I would add these things if the changes are 
> accepted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to