Index optimisations for shingles + ngram language modelling

Matthew Willson Mon, 11 Nov 2013 10:04:47 -0800

Hi all

So after attending Lucene Revolution (thanks all for some really
interesting talks!) I've gotten a renewed interested in using lucene to do
clever things with shingles.


The main problems with shingles seem to be that they swell the index size
quite a bit and that a lot of time can be spent in term dictionary lookups.
Things like the PulsingCodec help a bit with index size, but I was
wondering if anyone had considered any extra optimisations for shingles now
that codecs and formats are easier to replace?

In particular, I'm thinking about a term dictionary which uses some of the
same data structures and optimisations commonly used for ngram language
models, to index shingles. E.g. having a character-based FST for looking up
unigram terms, and then separate and more compact data structures for
looking up higher-order shingles based on ordinals for the component
unigrams in the shingle.

This kind of feature would mean that Lucene could be quite an efficient way
to score text against a number of ngram language models, as well as a
relatively compact way to store these models. Could be very useful for
multiclass text classification amongst other things of interest to the NLP
community.

Another thing which could help with using Lucene to do better ngram
language model scoring: allowing to store custom statistics on a per-term
or a per-term-per-document basis, essentially in the same slots which
document frequency and per-document term frequencies currently occupy. This
would allow one to store arbitrary log-probabilities or classifier weights
in postings lists, and perhaps things like per-ngram back-off constants. A
bit like Payloads, but without having to be on a per-term-position basis.

Any thoughts welcome :)

-Matt

Index optimisations for shingles + ngram language modelling

Reply via email to