Re: Binary indexing / query efficiency

eks dev Tue, 14 Apr 2009 16:19:09 -0700

you can store binary value?
e.g. with: 
Field(String name, byte[] value, Field.Store store)


You could store all your fields as byte[], so you get them back as byte[]. How 
you index them is just another problem, but you are having no problems with 
speed in your case, leave it as it is.

try simply to create pairs of fields for each  field you now have, one Stored 
and not indexed and another  Indexed and not stored. Or Fields you use for 
searching as only indexed, and one big byte[]  field where you encode all your 
documents (Blob)... if complex, you could try protobuf, thrift...     

Anyhow, your idea with byte[] as indexed unit that can be searched unit is 
maybe not all that bad, but it does not look like you need it and is not an 
easy one to change (I guess).  


----- Original Message ----
> From: "Eger, Patrick" <pe...@automotive.com>
> To: java-user@lucene.apache.org
> Sent: Tuesday, 14 April, 2009 20:12:34
> Subject: Binary indexing / query efficiency
> 
> Hi, was recently looking to incorporate Lucene for a simple
> "parametric"/"faceted" type search.  The documents are very small,
> roughly 15 fields of short length (5-15 characters, generally strings
> and padded integers). When profiling query performance of our
> application, which inserts 1 million documents then
> 1) filters on 1-3 fields with simple boolean/term matches
> 2) stores these docids in a BitSet
> 3) calls IndexSearcher.doc() to retrieve all matching documents (all
> fields, 100 - 1,000,000 results per call)
> 
> It turns out that 98% of the query time was spent not actually doing the
> query, but within the IndexSearcher.doc() call.
> 
> My first question is, is there any way to more efficiently get
> (all/most) of the fields for a set of documents, other than iterating
> and calling doc()?
> 
> Additionally, is there any way (or planned feature) to index *binary*
> data? Using a profiler, I have determined that String decoding is a
> significant performance limiter for my use-case:
> 
> 90% of the application time is spent in this method:
> ---------------------------------------
> org.apache.lucene.index.FieldsReader.addField(Document, FieldInfo,
> boolean, boolean, boolean)
> 
> 
> 46% of the application time is spent decoding strings (half of the above
> addField() time):
> ---------------------------------------org.apache.lucene.store.IndexInpu
> t.readString()
>     java.lang.String.(byte[], int, int, String)
>         java.lang.StringCoding.decode(String, byte[], int, int)
>     
> java.lang.StringCoding$StringDecoder.decode(byte[], int, int)
> 
> (YJP profiler output available if needed)
> 
> String.intern() was my top hot spot, but my patch was accepted and fixed
> this: https://issues.apache.org/jira/browse/LUCENE-1600. I'm not
> familiar enough with the lucene codebase to figure out the above though,
> so thought I would ask.
> 
> 
> 
> //ideally i'd be able to do add a binary field as such:
> doc.add(new Field("f1",new
> byte[]{1,2,3,4},Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));
> 
> //then query like:
> Query q = new TermQuery(new Term("f1",byte[]{1,2,3,4}))
> searcher.search(q,...);
> 
> Which would allow me to avoid the Integer -> String -> Padded String ->
> String -> Integer coding/decoding to index an integer, and avoid Object
> -> String -> Object conversion (which per above is quite expensive). 
> 
> 
> Thanks for any help!
> 
> 
> Regards, 
> 
> Patrick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Binary indexing / query efficiency

Reply via email to