Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

Savia Beson Sun, 23 Jun 2013 12:09:01 -0700

Uwe, 
I think Mathias was talking about the case with many smallish fields that all 
get read per document.  DV approach would mean seeking N times, while stored 
fields, only once? Or you meant he should encode all his fields  into single 
byte[]?


Or did I get it all wrong about stored vs DV :)

What helped a lot in a similar case was to make own codec and reduce chunk size 
to something smallish, depending on your average document size… there is a 
sweet spot somewhere compression/speed.

Simply make your own Codec and delegate to:

public final class MySmallishChunkStoredFieldFormat extends 
CompressingStoredFieldsFormat {

  /** Sole constructor. */
  public MySmallishChunkStoredFieldFormat() {
    //TODO: try different chunk sizes, maybe 1-2KB? 
    super("YourFormatName", CompressionMode.FAST, 1 << 12);
  }

}

   
On Jun 23, 2013, at 7:40 PM, Uwe Schindler <u...@thetaphi.de> wrote:

> Hi,
> 
> To do this type of processing, use the new DocValues field type. They are 
> like FieldCache but persisted to disk. Different datatypes exist and can be 
> used to get random access based on document number. They are organized as 
> column-stride fields, means each column is a separate data structure with 
> random access like a big array (persisted on disk).
> 
> Stored Fields should *only* ever be used to display search results!
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
>> -----Original Message-----
>> From: mathias....@gmail.com [mailto:mathias....@gmail.com] On Behalf Of
>> Mathias Lux
>> Sent: Sunday, June 23, 2013 7:27 PM
>> To: java-user@lucene.apache.org
>> Subject: Stored fields: decompression slows down in my scenario ... any idea
>> for a workaround?
>> 
>> Hi!
>> 
>> I'm managing the development of LIRE
>> (https://code.google.com/p/lire/), a image search toolbox based on Lucene.
>> While optimizing different search routines for global image features I came
>> around to take a look at the CPU usage, i.e. to see if my new distance
>> function is faster than the old one :)
>> 
>> Unfortunately I found out the the decompression routine for stored fields
>> made up for nearly 60% of the search time. (see
>> http://www.semanticmetadata.net/?p=1092)
>> 
>> So what I basically do is to open each document in an index sequentially,
>> check it upon distance to a query feature and maintain my result list. The
>> image features are in stored fields, byte[] arrays. I optimized quite a lot 
>> to
>> get them really small and fast to parse and store.
>> 
>> I know that this is not the way Lucene is intended to use, I'm working with
>> Lucene for years now :) And just to ensure you: approximate indexing and
>> local feature search are based on terms, ... and fast.
>> But linear search makes up an important part of LIRE, so I'd be glad to get
>> some suggestions how either to disable compression, or how to sneak in
>> byte[] data with some textual data that is "fast as hell" to read.
>> 
>> cheers,
>>  Mathias
>> 
>> ps. I know that it'd be possible to write it to a data file, put it into 
>> memory
>> and gain a lot of speed. But of course I'd prefer to maintain "just one" 
>> index
>> and not two of them :)
>> 
>> --
>> Dr. Mathias Lux
>> Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-
>> itec
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

Re: Stored fields: decompression slows down in my scenario ... any idea for a workaround?

Reply via email to