Re: LuCas extension

Erik Fäßler Fri, 20 Jan 2012 03:40:23 -0800

Hello Tommaso,

If you ask me whether the behavior of LuCas has changed when just not using my 
extension, the answer is "no". For all not aware of the new functionality 
should not notice any difference.
I have opened a JIRA issue on this, btw. Glad you seem to like the changes. 
Before I go on to document these, I would like to discuss one naming convention:
Initially, I thought of the new function as "a partition on a closed set of 
terms", such that each dynamically created field would only hold terms 
belonging to a single part of the partition. Now is that not really true: 
Multiple dynamically created fields may share the some terms. So "partition" 
would be a bit misleading. A more correct concept would be a "cover of a closed 
set of terms". While this seems to be more like it, a "cover of a set X" may 
even be bigger than X. I admit that this is a minor flaw in terminology.


So I would tend to use the name "cover" when describing what the function does 
(in documentation as well as in the code). Any opinions on that? Alternative 
ideas? As soon as I got feedback on this, I will begin write documentation and 
polish my code.

Thanks and best regards,

        Erik

Am 18.01.2012 um 14:12 schrieb Tommaso Teofili:

> Hi Erik,
> 
> sorry for my late reply.
> 
> 2012/1/6 Erik Fäßler <[email protected]>
> 
>> Hello everyone,
>> 
>> In the course of my work I needed LuCas to do a few things which were not
>> possible or at least not too easy out-of-the-box. I checked out the latest
>> LuCas version and adapted it to suit my needs.
>> In this mail I'd like to describe the changes I have made. In the case you
>> feel these changes should make their way back into the repository, I would
>> be glad to share them. On the other hand it might be that some of the
>> things I did already were possible before (I just could have missed that).
>> For such cases I would refactor my changes, of course.
>> 
>> The main extension arose from the following idea:
>> 
>> In the documents I want to index, gene names and identifiers are tagged
>> (into a UIMA type 'Gene'). These identifiers are indexed so you can search
>> for them. For faceting purposes I send these identifiers into a Lucene
>> field named 'facetTerms'. However, I have quite a whole lot of identifiers
>> AND the identifiers are organized in multiple categories in my application.
>> The best thing for me would be to have a single field for each of these
>> categories, containing only gene identifiers belonging to this category.
>> This allows to easily obtain facet counts per category.
>> 
>> Now I have over 20 categories and I did not like the idea of a LuCas
>> mapping file with 20 copies of nearly the same field definition.
>> 
>> So I allowed new attributes to a field element in the mapping file. These
>> attributes would specify
>> 
>> A file determining the association between each possible term and its
>> category (same format as hypernym file, so one term can belong to multiple
>> categories);
>> The naming scheme of the new fields;
>> Whether to ignore the case when comparing the entries of the above
>> mentioned file to the actual terms extracted from documents.
>> 
>> I wrote a class which realizes the distribution of the terms to their
>> categories by creating the corresponding TokenStreams. Each TokenStream is
>> supposed to let only those tokens pass which belong to its category. These
>> tokens are determined by the association file described above. Thus we need
>> the opposite of a StopWordFilter. I've added the 'SelectFilter' for this
>> purpose. This filter mainly takes a set representing a closed vocabulary
>> and lets tokens pass which are included in the set and denies other tokens
>> (here comes the ignore option into play).
>> 
>> Another thing I did was to implement a RegExp replacement filter - it
>> simply matches token string against a regular expression. On match the
>> token string is replaced by a given replacement string (may include reg exp
>> replacement characters like &).
>> 
>> If you would agree that these things are good extensions to LuCas I would
>> be glad to add some documentation and create a patch file which I would
>> send to one of the committers.
>> 
> 
> Thanks for sharing this very useful insights, I do think we should
> incorporate this extension.
> Is there any existing use case that this improvement is going to drop ?
> 
> Tommaso
> 
> 
> 
>> 
>> Best regards,
>> 
>>       Erik

Re: LuCas extension

Reply via email to