Hi Erik, sorry for my late reply.
2012/1/6 Erik Fäßler <[email protected]> > Hello everyone, > > In the course of my work I needed LuCas to do a few things which were not > possible or at least not too easy out-of-the-box. I checked out the latest > LuCas version and adapted it to suit my needs. > In this mail I'd like to describe the changes I have made. In the case you > feel these changes should make their way back into the repository, I would > be glad to share them. On the other hand it might be that some of the > things I did already were possible before (I just could have missed that). > For such cases I would refactor my changes, of course. > > The main extension arose from the following idea: > > In the documents I want to index, gene names and identifiers are tagged > (into a UIMA type 'Gene'). These identifiers are indexed so you can search > for them. For faceting purposes I send these identifiers into a Lucene > field named 'facetTerms'. However, I have quite a whole lot of identifiers > AND the identifiers are organized in multiple categories in my application. > The best thing for me would be to have a single field for each of these > categories, containing only gene identifiers belonging to this category. > This allows to easily obtain facet counts per category. > > Now I have over 20 categories and I did not like the idea of a LuCas > mapping file with 20 copies of nearly the same field definition. > > So I allowed new attributes to a field element in the mapping file. These > attributes would specify > > A file determining the association between each possible term and its > category (same format as hypernym file, so one term can belong to multiple > categories); > The naming scheme of the new fields; > Whether to ignore the case when comparing the entries of the above > mentioned file to the actual terms extracted from documents. > > I wrote a class which realizes the distribution of the terms to their > categories by creating the corresponding TokenStreams. Each TokenStream is > supposed to let only those tokens pass which belong to its category. These > tokens are determined by the association file described above. Thus we need > the opposite of a StopWordFilter. I've added the 'SelectFilter' for this > purpose. This filter mainly takes a set representing a closed vocabulary > and lets tokens pass which are included in the set and denies other tokens > (here comes the ignore option into play). > > Another thing I did was to implement a RegExp replacement filter - it > simply matches token string against a regular expression. On match the > token string is replaced by a given replacement string (may include reg exp > replacement characters like &). > > If you would agree that these things are good extensions to LuCas I would > be glad to add some documentation and create a patch file which I would > send to one of the committers. > Thanks for sharing this very useful insights, I do think we should incorporate this extension. Is there any existing use case that this improvement is going to drop ? Tommaso > > Best regards, > > Erik
