Re: LuCas extension

Tommaso Teofili Wed, 18 Jan 2012 05:13:26 -0800

Hi Erik,

sorry for my late reply.


2012/1/6 Erik Fäßler <[email protected]>

> Hello everyone,
>
> In the course of my work I needed LuCas to do a few things which were not
> possible or at least not too easy out-of-the-box. I checked out the latest
> LuCas version and adapted it to suit my needs.
> In this mail I'd like to describe the changes I have made. In the case you
> feel these changes should make their way back into the repository, I would
> be glad to share them. On the other hand it might be that some of the
> things I did already were possible before (I just could have missed that).
> For such cases I would refactor my changes, of course.
>
> The main extension arose from the following idea:
>
> In the documents I want to index, gene names and identifiers are tagged
> (into a UIMA type 'Gene'). These identifiers are indexed so you can search
> for them. For faceting purposes I send these identifiers into a Lucene
> field named 'facetTerms'. However, I have quite a whole lot of identifiers
> AND the identifiers are organized in multiple categories in my application.
> The best thing for me would be to have a single field for each of these
> categories, containing only gene identifiers belonging to this category.
> This allows to easily obtain facet counts per category.
>
> Now I have over 20 categories and I did not like the idea of a LuCas
> mapping file with 20 copies of nearly the same field definition.
>
> So I allowed new attributes to a field element in the mapping file. These
> attributes would specify
>
> A file determining the association between each possible term and its
> category (same format as hypernym file, so one term can belong to multiple
> categories);
> The naming scheme of the new fields;
> Whether to ignore the case when comparing the entries of the above
> mentioned file to the actual terms extracted from documents.
>
> I wrote a class which realizes the distribution of the terms to their
> categories by creating the corresponding TokenStreams. Each TokenStream is
> supposed to let only those tokens pass which belong to its category. These
> tokens are determined by the association file described above. Thus we need
> the opposite of a StopWordFilter. I've added the 'SelectFilter' for this
> purpose. This filter mainly takes a set representing a closed vocabulary
> and lets tokens pass which are included in the set and denies other tokens
> (here comes the ignore option into play).
>
> Another thing I did was to implement a RegExp replacement filter - it
> simply matches token string against a regular expression. On match the
> token string is replaced by a given replacement string (may include reg exp
> replacement characters like &).
>
> If you would agree that these things are good extensions to LuCas I would
> be glad to add some documentation and create a patch file which I would
> send to one of the committers.
>

Thanks for sharing this very useful insights, I do think we should
incorporate this extension.
Is there any existing use case that this improvement is going to drop ?

Tommaso



>
> Best regards,
>
>        Erik

Re: LuCas extension

Reply via email to