Hi Uwe, Thanks for the explanation! It really helps. That makes sense that for a small number of values, such as "hour" NumericField is not going to help me. I'm experimenting with using epoch NumericField for sorting, which funnily is where I started with 2.4.1, before going down the usual TooManyClauses path and breaking it down to multiple fields. 2.9 seems a great improvement there. Downloading the new 2.9 rc4...
Thanks, Phil On Sat, Sep 12, 2009 at 1:55 AM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi Phil, > > thanks for checking out NumericField. I have two comments about your > problem: > >> I've used NumericField to store my "hour" field. >> >> Example... >> >> doc.add(new >> NumericField("hour").setIntValue(Integer.parseInt("12"))); > > NumericField uses a spezial encoding of terms for fast NumericRangeQueries. > It indexes more than one term per value. How many terms depends on the > precisionStep ctor parameter. If you set it to infinity (or something ge the > bit size of your value, 32 for ints, it indexes exactly one value). These > terms are used for very fast numeric queries. This extra overhead only has a > positive impact for field with high cardinality (something > 500). For a > simple hour field with 24 distinct values, the speed impact of > NumericRangeQuery would be neglectible, it may even be a little bit slower > because of additional overhead. I would suggest to use NumericField ony for > real high-cardinality fields (like unix time stamps, prices, > latitudes/longitudes (all types of float/doubles), day of year,...). > > Maybe I add this t the javadocs. > >> Before I was using plain string Field and enumerating them with >> TermEnum, which worked fine. >> Now I'm using NumericField's I'm not sure how to port this enumeration >> code. > > As explained above, each numerfic value is indexed by more than one term, so > your termenum is of no use. There are some tricks to get the distict values, > but this needs deeper knowledge of the underlying term structure encoding of > terms, shift value,... - see the FieldCache parsers for numeric fields). > > As your field (hours) is of low cardinality, you can index with > precisionStep=Integer.MAX_VALUE. Range queries will be not faster than with > normal TermRangeQuery and your term enum will work. You only have to use > NumericUtils.prefixCodedToInt() to decode the term into a int: > > hours.add( Integer.valueOf(NumericUtils.prefixCodedToInt(term.text()) ); > > This code would also work for other precision steps, but you would get some > additional "lower precision terms" (values with some lower bits removed). > You have to break iteration in this case (see FieldCache code). > >> Any pointers? >> >> This is the code I was using previously for plain Fields. >> >> ArrayList hours = new ArrayList(); >> TermEnum termEnum = reader.terms( new Term( "hour", "" ) ); >> Term term = null; >> while ( ( term = termEnum.term() ) != null ) { >> >> if ( ! term.field().equals( "hour" ) ) >> break; >> >> hours.add( (Integer)term.text() ); >> termEnum.next(); >> } >> >> Thanks, >> Phil >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Mobile: +1 778-233-4935 Website: http://philw.co.uk Skype: philwhelan76 Twitter: philwhln Email : phil...@gmail.com iChat: philw...@mac.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org