RE: TrieRangeQuery for contrib?

Uwe Schindler Tue, 25 Nov 2008 12:08:36 -0800

Hi Mike,

> > - Inside Luke, the values of such "Trie" fields are not human readable
> > (because of the encoding). Even when stored, the current
> > implementation uses
> > the special encoding to store the field. For displaying the field
> > you have
> > to use the decoder from the TrieUtils class. But this is the same with
> > current DateUtils from Lucene (but they are more readable :-) )
> 
> These seems OK, for starters.  Eventually maybe such a "range field"
> could provide an interface that knows how to "subdivide" intervals on
> its space of all terms, assigning more human readable labels to these
> subdivisions, instead of always casting to unsigned long.


This maybe a way to encode the vaules different. The other approach would be
to simpy "store" the field value as plain human-readable string, but have
the terms for searching encoded. For my project panFMP the encoded variant
was fitted cleaner in the framework.

The interface would be more complicated to work, but maybe possible (I
haven't thought about it). To note: The encoded values in the different
precisions are not stored in different fields, all different precissions
have the same field name, which makes index maintenance easier. The values
with lower precision are prefixed by a "precision" marker that makes it
possible to distinguish them and put the Term iterator to the lower bound in
the current precision for the range. Without prefix, the values for the
different precisions would be mixed. The first version of TrieRangeQuery
used different field names, but by the prefix trick, all terms could be in
one field.

> > Comparisions with the above 500,000 doc index showed that the old
> > RangeQuery
> > (with raised BooleanQuery clause count) took about 30-40 secs to
> > complete,
> > ConstantScoreRangeQuery took 5 secs and TrieRangeQuery took <100ms to
> > complete (on an Opteron64 machine, Java 1.5). You can test a little
> > bit on
> > http://www.pangaea.de/advanced/advsearch.php by entering something
> > into the
> > geographic bounding box or temporal coverage). As you can see, the
> > usage of
> > this range query type is optimal for geographic searches using
> > doubles (not
> > fixed decimals!), longs or dates as keys.
> 
> Wow it's very fast!  I first searched for "water", which returned
> ~428K docs, then bounded it roughly around Africa and it returned ~78K
> docs, very quickly.  Now I'd really love to get this into Lucene!

Thank you! As I see, you also tested the map :)

> > - I want to be able to develop the code further once in contrib, is
> > this
> > possible? How would be the best to handle this? Let the code stay in
> > my SVN
> > and you update it or let me commit to the contrib folder in Lucene?
> > Currently the code is in SVN of panFMP (www.panfmp.org) that uses
> > it. When
> > donated to Apache, I would put a dependency into panFMP to the contrib
> > Package, once released and remove it from my tree. I do not want to
> > get the
> > code into a dead end or start a fork of it inside contrib, because I
> > want to
> > actively maintain it.
> 
> I think for starters open an issue, attach a patch, and then we
> iterate from there?  Probably having the code in Apache's SVN, with
> the eventual goal of giving you commit rights to contrib, is what we
> should aim for?

I think, this would be good. Give me some time, to prepare the patch and we
discuss it then in JIRA.

Thanks,
Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: TrieRangeQuery for contrib?

Reply via email to