Spell check on a subset of an index ( 'namespace' aware spell checker)

E. van Chastelet Thu, 10 Nov 2011 04:16:49 -0800

Hi all,

In our project we like to have the ability to get search results scopedto one 'namespace' (as we call it). This can easily be achieved by usinga filter or just an additional must-clause.For the spellchecker (and our autocompletion, which is a modifiedspellchecker), the story seems different. The spell checker index iscreated using a LuceneDictionary, which has a IndexReader as source. Wewould like to get (spellcheck/autocomplete) suggestions that are scopedto one namespace (i.e. field 'namespace' should have a particular value).With a single source index containing docs for all namespaces, it seemsnot possible to create a spellcheck index for each namespace theordinary way.Q1: Is there a way to construct a LuceneDictionary from a subset of asingle source index (all terms where namespace = %value%) ?

Another, maybe better solution is to customize the spellchecker byadding an additional namespace field to the spellchecker index. Atquery-time, an additional must-clause is added, scoping the suggestionsto one (or more) namespace(s). The advantage of this is to have asingleton spellchecker (or at least the index reader) for allnamespaces. This also means less open files by our application (imagineif there are over 1000 namespaces).Q2: Will there be a significant penalty (say more than 50% slower) forthe additional must-clause at query time?


Q3: Or can you think of a better solution for this problem? :)

How we currently do it: we currently use Lucene 3.1 with HibernateSearch and we actually already have auto completion and spell checkingscoped to one namespace. This is currently achieved by using indexsharding, so each namespace has its own index and reader, and anotherfor spell check and auto completion. Unfortunately there are somedownsides to this:- Our faceting engine has no good support for multiple indexes, sofaceting only works on a single namespace- Needs administration for mapping namespace identifier (String) toindex number (integer)- The number of shards (and thus name spaces) is currently hardcoded. Atthis moment it is set to 100, and this means Hibernate Search opens up100 index readers/writers, while only n<100 are in use. and therfore:

- Much open file descriptors
- Hard limit on number of namespaces

Therefore it seems better to switch back to having a single index forall namespaces.


Thanks!

Regards,
Elmer van Chastelet


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Spell check on a subset of an index ( 'namespace' aware spell checker)

Reply via email to