Hey Jason
o.a.l.s.trie looks interesting and has a lot of potential, locallucene
1.5+ release moved to a Cartesian tier system away from
the boundary box filter a while though.
A TierRange or RangeFilter as the one I used in v1.0 was a little
inefficient as you have to do a bit AND on 2 range look ups
e.g.
RangeFilter(min-latitude, max-latitude) AND
RangeFilter(min-longitude, max-longitude)
(I extended the Filter class with an ISerialChainFilter to improve
performance)
The 1.5+ version of locallucene does it differently, where I
pre-generate the bounding shape's Cartesian id's, so all the boxes that
make up
the overall bounding box, and simply pull the matching doc id's out of
the TermEnumerator.
Take a look at the CartesianShapeFilter
http://locallucene.svn.sourceforge.net/viewvc/locallucene/trunk/locallucene/src/java/com/pjaol/search/geo/utils/CartesianShapeFilter.java?revision=66&view=markup
This gives you a bounding box lookup of about 3 - 4 ms on a 3 million
doc index.
Thanks
Patrick
Sean Timm wrote:
About LocalLucene, it would benefit (be faster) by integrating with
TrieRangeQuery for the bounding box filter.
On Sun, Dec 14, 2008 at 3:54 AM, Michael
McCandless <luc...@mikemccandless.com>
wrote:
I'd
also personally like to see 2.9 released sooner rather than later,
maybe earliesh next year?
I don't think we should hold up 2.9 for some of the big items below
(eg Fieldable/AbstractField/Field cleanup), especially if they have
not even been started yet.
One question: I'm assuming after 2.9 is out, we would fairly quickly
follow that up with a 3.0 that has more or less just removed
deprecations?
(Vs doing alot of dev putting new features into 3.0 as well).
More comments below:
Grant Ingersoll wrote:
1. Splitting Index time Document from Search time Document per Hoss'
ideas on a variety of threads in the past. Something to the gist of
having an InputDocument and an OutputDocument (and maybe an abstract
Document for shared features) such that people wouldn't be confused
about calling index time things on Document during search and vice
versa.
Maybe don't hold 2.9 for this one? (There's been lots of discussion,
and also recently interesting discussion on adding type safety to
Document under LUCENE-831, but nothing yet concrete).
2. Java 1.5 (who knows, maybe by 2020, we can be on 1.6!). This means
we can use Generics, or as I like to call them "Specifics" since the
specifically say what is in the collection as opposed to the current
collections where you can put any generic object in them. :-)
We get this one "for free" ;)
3. Michael B. is proposing a new Token API (but it's back compat.)
Already in and quite a big cutover.
4. Mike M. is doing some new flex indexing stuff (but it's back compat.)
I would not hold 3.0 for this; it's still big & exploratory at this
point.
5. Is there anything we would need to deprecate now if we were to take
advantage of 1.5 concurrent packages?
Has anyone looked into this?
6. Local Lucene is of interest to a lot of people. Does it require
anything special in terms of deprecation? (me thinks not)
Any more clarity on this? I would also assume not.
7. Same goes for the real time stuff, PFOR implementation,
column-stride fields, etc.
I think we tackle real-time needs one by one (eg LUCENE-1484, removing
sync on IndexReader.document(), which I'm working on now, doesn't
deprecate anything). PFOR I think is blocked by flex indexing.
Column-stride fields would be great to get it, but doesn't seem to
have any forward motion for quite a while...
8. I think we should do a review of what's open in JIRA again and see
if we can come to conclusions on any of them, such that going into 3.0,
JIRA is relatively clean.
I've been trying over time to mark things as fix version 2.9, so we are
at least forced to review them come 2.9.
9. For 3.0, what cruft from 1.x can we remove from the file format,
since 3.x need not read 1.x format _if_ doing so is advantageous to us?
I'm not sure offhand. There's alot of scary cruft in
SegmentInfo.files(), but it's from 2.0 -> 2.1 so we need to keep it
for now (to remove in 4.0, in 2020 ;) ).
10. There has been some talk about changing how StandardTokenizer
labels some tokens. What can we do in there to deprecate?
I think we need a more incremental approach, somehow, for
StandardTokenizer. Like it does its own internal versioning or
something. There have been lots of little cases over time where it
needs fixing, yet, it would be a break in back compat to fix them.
11. Fieldable. Ah, Fieldable. I believe this is going to become an
abstract base class, or go away.
This is a biggie and nobody's stepped up so far to tackle it... I would
say don't hold up 2.9 for this.
Maybe add these ones:
12. LUCENE-1483 -- running Scorer & HitCollector "per segment". We
are making good progress here, and uncovering some nice per-query
performance wins plus much faster searcher warming (sicne FieldCache is
only used per-segment). On the current path it looks likely to
deprecate current Field sorting classes, so it'd be great to get this
in before 2.9.
13. LUCENE-831 (new FieldCache API). This is long standing and there's
a fair amount of interest, and through our iterations with LUCENE-1483
(one of the primary users of the FieldCache API, field sorting) we are
getting more clarity on what a new FieldCache API should look like.
It'd be nice to resolve before 2.9, and I'd like to spend time doing
so (after / with LUCENE-1483).
Mike
--
Patrick O'Leary
AOL Local Search Technologies
Phone: + 1 703 265 8763
You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles.
Do you understand this?
And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that there is no cat.
- Albert Einstein
View
Patrick O Leary's profile
|