Hi all So after attending Lucene Revolution (thanks all for some really interesting talks!) I've gotten a renewed interested in using lucene to do clever things with shingles.
The main problems with shingles seem to be that they swell the index size quite a bit and that a lot of time can be spent in term dictionary lookups. Things like the PulsingCodec help a bit with index size, but I was wondering if anyone had considered any extra optimisations for shingles now that codecs and formats are easier to replace? In particular, I'm thinking about a term dictionary which uses some of the same data structures and optimisations commonly used for ngram language models, to index shingles. E.g. having a character-based FST for looking up unigram terms, and then separate and more compact data structures for looking up higher-order shingles based on ordinals for the component unigrams in the shingle. This kind of feature would mean that Lucene could be quite an efficient way to score text against a number of ngram language models, as well as a relatively compact way to store these models. Could be very useful for multiclass text classification amongst other things of interest to the NLP community. Another thing which could help with using Lucene to do better ngram language model scoring: allowing to store custom statistics on a per-term or a per-term-per-document basis, essentially in the same slots which document frequency and per-document term frequencies currently occupy. This would allow one to store arbitrary log-probabilities or classifier weights in postings lists, and perhaps things like per-ngram back-off constants. A bit like Payloads, but without having to be on a per-term-position basis. Any thoughts welcome :) -Matt