[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib

Uwe Schindler (JIRA) Wed, 26 Nov 2008 15:59:48 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651185#action_12651185
 ]


Uwe Schindler commented on LUCENE-1470:
---------------------------------------

I wanted to go to bed, but it is the same every time, cannot sleep :-)

The discussions about TrieRangeQuery are the same like two years ago when I 
prepared the original paper cited in the documentation with my mentor/colleague 
Michael Diepenbroek (the second author on the paper) -- here my comments about 
the suggestions to Mike McCandless and Earwin, after looking into my notices, I 
can explain:

When looking for the first time on the algorithm to split the range, the ideas 
of Mike and Earwin are very clear: You are both interested in Dates and both of 
you have the same idea, how to divide them into parts with different 
"precisions". For dates it is rather simple: If you have a range between March 
1999 and August 2006, the two precisions "full years" and "months" are very 
easy to understand when creating the ranges. You would have three range 
queries: Mar'1999 ... Dec'1999, 2000 ... 2005, Jan'2006 ... Aug'2006 very easy. 
You have to visit 10+6+8=24 terms. Simple. The maximum number of terms would be 
12+12+numberOfTotalPossibleYears. This is easy to understand and very effective.

My approach in principle does the same, but the values are simply devided into 
bytes. There is no problem with it to store the number of milliseconds since 
1970-01-01 in different precisions. The total range of this long would be very 
very large (I do not want to calculate now how many years BC the value -2^-31ms 
would be). If you have a range like the one above you have e.g. longs like 
0x0000323113422134 TO 0x0000325113422174. You see the frront is identical, the 
center of the range maybe only goes to the precision of 12 digits. I said the 
long ist divided into 8 bytes, so you would use only the end of the range in 
highest precsision, the center in lower. As the lowest precsison does not 
change (it would only change if you are thousands of years from start to end), 
my algorithm would not use it, it would only go downto the precision that 
exists. The if-clause "(length==1 || minShort.compareTo(maxShort)>=0)" is 
responsible for that. If the lower and upper end of the next lower precision 
range is zero or inverted, it is not further calculated.

Because of this shortcut it does not make a difference if the term values are 
distributed in the index very far from each other or are all on one place (the 
algorithm works as good, if your index contains prices between 1$ and 100$ or 
if you have double that give ages of millions of years and you want to do a 
range select on them. Independent on the selected range the number of terms to 
be visited is limited by my algorithm to the number of 3825 = (a maximum of 255 
terms in the lowermost range)+(a maximum of 255 terms in the uppermost 
range)+(255)+(255)+......+(255 in center of range). This is a hard limit, and 
this limit can only be reached if you would have an index with 2^64-2 
documents, that have 2^64-2 different long values and you have a range that 
selects all of them (only in this case you have to visit 3825 terms!).

You can test a little bit with the new test case in the latest patch (change 
some numbers) and uncomment the System.out.println() for the number of terms 
seen. You will see, even with large indexes, the number of terms in mostly 
about 400. Only very very seldom more terms were visited (I have seen in our 
500,000 docs index only one case with 750 terms, which can happen if you have a 
non-homogenous dispersion of values that has a local maximum near one of the 
range bounds). Important: The TrieRangeQuery only makes sense if your index as 
a minimum of about 500 docs, if there are less, you mostly have to visit all 
terms. Just play with the test case and do some statistics!

Because of all this, I see no real need for giving the user a customized way to 
convert the values into Terms and calculate lower precisions of it. The above 
code works performant (you have tested it yesterday with our index) even for 
doubles, that are very seldom homogenous distributed through the whole LONG 
range if stored as IEEE-754 bitmap. So TrieRangeQuery in the current version is 
effective for longs, doubles and Dates (as they are longs). The only problem 
would be fixed comma numbers with many digits (e.g. BigDecimals), they cannot 
simply casted to longs. For these type of numbers, TrieUtils could be extended 
to generate longer strings.

To space usage: You are right, if you only want to have ranges on year/month, a 
long is too much. But keep in mind, the length of terms is not so important for 
index size, important is the number of different terms in the TermEnum. So I 
see no problem with storing integers and doubles and dates as a 64bit value in 
index - its only little bit extra term length with "unused bits".

> Add TrieRangeQuery to contrib
> -----------------------------
>
>                 Key: LUCENE-1470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 2.4
>            Reporter: Uwe Schindler
>            Assignee: Michael McCandless
>         Attachments: LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 500000 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib

Reply via email to