Re: [Fwd: Re: 2.9, 3.0 and deprecation]

patrick o'leary Mon, 15 Dec 2008 13:45:25 -0800

Hi Uwe

True it's not a generic solution, but then again I wouldn't really consider geo-search a generic ask.
The indexing format for locallucene uses something I call a tier approach, similar to zoom levels in other mapping solutions.

Each tier has a separate set of projects or Cartesian id's, and the projection interface has a bestFit function providing you with the optimal
tier to search on.

A quick explanation with graphics is here: http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene_v2.html

The tier level and # of Cartesian id's are proportional thus a tier level 0 has 1 Cartesian id of "0.0" representing all results
tier level 1 has 4 possible id's, level 2 has 16 etc.,
Where # ids = [ (2^tier) ^2] again specialized but on purpose to provide optimal lookup.

Intersections and overlapping MBB's can also be done using best fit look up's using the TermDocs seek.

I've only read through the jdoc of tier so far, but I'm guessing it's doing a dictionary search and splitting the the index readers position
based on the result being less than or greater than upper / lower values. Which may be faster than a TermDocs seek, and certainly
worth while investigating.

But the ability to reduce your search from a 2 range filter look up to just a minimal set of term seeks, is what will give you better performance.
The bestFit function if I remember is designed to give you between 1 and 6 terms to lookup, and for a boundary box search that's all you need.

Thanks
Patrick

Uwe Schindler wrote:

Hi Patrick,

very interesting approach. In my opinion to compare:

A standard RangeFilter is like using a standard relational database with separate lat/lon fields, no index and operator “between”. TrieRangeFilter is the same like adding indexes to the lat/lon fields, which is for most cases enough. Your LocalLucene approach is like using a relational database (e.g. Oracle) that is able to directly handle point/bbox coordinates and index them efficient.

Is this correct? The more special the implementation is for the underlying data structure, the faster it is J. The drawback of your solution is, that is too specialized and TrieRangeQuery is optimal for ranges in a wider usage outsie of local queries.

How does your implementation behave, when the query hits e.g. half of all documents or somebody selects (-180,-90,180,90) to get all documents? How does it behave with half open ranges, intersections?

Uwe

-----
UWE SCHINDLER
Webserver/Middleware Development
PANGAEA - Publishing Network for Geoscientific and Environmental Data
MARUM - University of Bremen
Room 2500, Leobener Str., D-28359 Bremen
Tel.: +49 421 218 65595
Fax: +49 421 218 65505
http://www.pangaea.de/
E-mail: [email protected]
From: patrick o'leary [mailto:[email protected]]
Sent: Monday, December 15, 2008 9:14 PM
To: [email protected]
Subject: Re: [Fwd: Re: 2.9, 3.0 and deprecation]

Hey Jason

o.a.l.s.trie looks interesting and has a lot of potential, locallucene 1.5+ release moved to a Cartesian tier system away from
the boundary box filter a while though.

A TierRange or RangeFilter as the one I used in v1.0 was a little inefficient as you have to do a bit AND on 2 range look ups
e.g.

RangeFilter(min-latitude, max-latitude) AND RangeFilter(min-longitude, max-longitude)
(I extended the Filter class with an ISerialChainFilter to improve performance)

The 1.5+ version of locallucene does it differently, where I pre-generate the bounding shape's Cartesian id's, so all the boxes that make up
the overall bounding box, and simply pull the matching doc id's out of the TermEnumerator.

Take a look at the CartesianShapeFilter http://locallucene.svn.sourceforge.net/viewvc/locallucene/trunk/locallucene/src/java/com/pjaol/search/geo/utils/CartesianShapeFilter.java?revision=66&view=markup

This gives you a bounding box lookup of about 3 - 4 ms on a 3 million doc index.

Thanks
Patrick

Sean Timm wrote:

Subject:

Re: 2.9, 3.0 and deprecation

From:

"Jason Rutherglen" <[email protected]>

Date:

Mon, 15 Dec 2008 12:29:38 -0500

To:

<[email protected]>

To:

<[email protected]>

About LocalLucene, it would benefit (be faster) by integrating with TrieRangeQuery for the bounding box filter.

On Sun, Dec 14, 2008 at 3:54 AM, Michael McCandless <[email protected]> wrote:

I'd also personally like to see 2.9 released sooner rather than later,
maybe earliesh next year?

I don't think we should hold up 2.9 for some of the big items below
(eg Fieldable/AbstractField/Field cleanup), especially if they have
not even been started yet.

One question: I'm assuming after 2.9 is out, we would fairly quickly
follow that up with a 3.0 that has more or less just removed deprecations?
(Vs doing alot of dev putting new features into 3.0 as well).

More comments below:

Grant Ingersoll wrote:

1. Splitting Index time Document from Search time Document per Hoss' ideas on a variety of threads in the past. Something to the gist of having an InputDocument and an OutputDocument (and maybe an abstract Document for shared features) such that people wouldn't be confused about calling index time things on Document during search and vice versa.

Maybe don't hold 2.9 for this one? (There's been lots of discussion, and also recently interesting discussion on adding type safety to Document under LUCENE-831, but nothing yet concrete).

2. Java 1.5 (who knows, maybe by 2020, we can be on 1.6!). This means we can use Generics, or as I like to call them "Specifics" since the specifically say what is in the collection as opposed to the current collections where you can put any generic object in them. :-)

We get this one "for free" ;)

3. Michael B. is proposing a new Token API (but it's back compat.)

Already in and quite a big cutover.

4. Mike M. is doing some new flex indexing stuff (but it's back compat.)

I would not hold 3.0 for this; it's still big & exploratory at this point.

5. Is there anything we would need to deprecate now if we were to take advantage of 1.5 concurrent packages?

Has anyone looked into this?

6. Local Lucene is of interest to a lot of people. Does it require anything special in terms of deprecation? (me thinks not)

Any more clarity on this? I would also assume not.

7. Same goes for the real time stuff, PFOR implementation, column-stride fields, etc.

I think we tackle real-time needs one by one (eg LUCENE-1484, removing sync on IndexReader.document(), which I'm working on now, doesn't deprecate anything). PFOR I think is blocked by flex indexing. Column-stride fields would be great to get it, but doesn't seem to have any forward motion for quite a while...

8. I think we should do a review of what's open in JIRA again and see if we can come to conclusions on any of them, such that going into 3.0, JIRA is relatively clean.

I've been trying over time to mark things as fix version 2.9, so we are at least forced to review them come 2.9.

9. For 3.0, what cruft from 1.x can we remove from the file format, since 3.x need not read 1.x format _if_ doing so is advantageous to us?

I'm not sure offhand. There's alot of scary cruft in SegmentInfo.files(), but it's from 2.0 -> 2.1 so we need to keep it for now (to remove in 4.0, in 2020 ;) ).

10. There has been some talk about changing how StandardTokenizer labels some tokens. What can we do in there to deprecate?

I think we need a more incremental approach, somehow, for StandardTokenizer. Like it does its own internal versioning or something. There have been lots of little cases over time where it needs fixing, yet, it would be a break in back compat to fix them.

11. Fieldable. Ah, Fieldable. I believe this is going to become an abstract base class, or go away.

This is a biggie and nobody's stepped up so far to tackle it... I would say don't hold up 2.9 for this.

Maybe add these ones:

12. LUCENE-1483 -- running Scorer & HitCollector "per segment". We are making good progress here, and uncovering some nice per-query performance wins plus much faster searcher warming (sicne FieldCache is only used per-segment). On the current path it looks likely to deprecate current Field sorting classes, so it'd be great to get this in before 2.9.

13. LUCENE-831 (new FieldCache API). This is long standing and there's a fair amount of interest, and through our iterations with LUCENE-1483 (one of the primary users of the FieldCache API, field sorting) we are getting more clarity on what a new FieldCache API should look like. It'd be nice to resolve before 2.9, and I'd like to spend time doing so (after / with LUCENE-1483).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--
Patrick O'Leary
 
AOL Local Search Technologies
Phone: + 1 703 265 8763
 
You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles.
 Do you understand this? 
And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that there is no cat.
  - Albert Einstein
View Patrick O Leary's profile

Patrick O'Leary


AOL Local Search Technologies
Phone: + 1 703 265 8763

You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles.
 Do you understand this? 
And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that there is no cat.
  - Albert Einstein

View Patrick O Leary's profile

Re: [Fwd: Re: 2.9, 3.0 and deprecation]

Reply via email to