Uwe,
I think Mathias was talking about the case with many smallish fields that all
get read per document. DV approach would mean seeking N times, while stored
fields, only once? Or you meant he should encode all his fields into single
byte[]?
Or did I get it all wrong about stored vs DV :)
What helped a lot in a similar case was to make own codec and reduce chunk size
to something smallish, depending on your average document size⦠there is a
sweet spot somewhere compression/speed.
Simply make your own Codec and delegate to:
public final class MySmallishChunkStoredFieldFormat extends
CompressingStoredFieldsFormat {
/** Sole constructor. */
public MySmallishChunkStoredFieldFormat() {
//TODO: try different chunk sizes, maybe 1-2KB?
super("YourFormatName", CompressionMode.FAST, 1 << 12);
}
}
On Jun 23, 2013, at 7:40 PM, Uwe Schindler <[email protected]> wrote:
> Hi,
>
> To do this type of processing, use the new DocValues field type. They are
> like FieldCache but persisted to disk. Different datatypes exist and can be
> used to get random access based on document number. They are organized as
> column-stride fields, means each column is a separate data structure with
> random access like a big array (persisted on disk).
>
> Stored Fields should *only* ever be used to display search results!
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of
>> Mathias Lux
>> Sent: Sunday, June 23, 2013 7:27 PM
>> To: [email protected]
>> Subject: Stored fields: decompression slows down in my scenario ... any idea
>> for a workaround?
>>
>> Hi!
>>
>> I'm managing the development of LIRE
>> (https://code.google.com/p/lire/), a image search toolbox based on Lucene.
>> While optimizing different search routines for global image features I came
>> around to take a look at the CPU usage, i.e. to see if my new distance
>> function is faster than the old one :)
>>
>> Unfortunately I found out the the decompression routine for stored fields
>> made up for nearly 60% of the search time. (see
>> http://www.semanticmetadata.net/?p=1092)
>>
>> So what I basically do is to open each document in an index sequentially,
>> check it upon distance to a query feature and maintain my result list. The
>> image features are in stored fields, byte[] arrays. I optimized quite a lot
>> to
>> get them really small and fast to parse and store.
>>
>> I know that this is not the way Lucene is intended to use, I'm working with
>> Lucene for years now :) And just to ensure you: approximate indexing and
>> local feature search are based on terms, ... and fast.
>> But linear search makes up an important part of LIRE, so I'd be glad to get
>> some suggestions how either to disable compression, or how to sneak in
>> byte[] data with some textual data that is "fast as hell" to read.
>>
>> cheers,
>> Mathias
>>
>> ps. I know that it'd be possible to write it to a data file, put it into
>> memory
>> and gain a lot of speed. But of course I'd prefer to maintain "just one"
>> index
>> and not two of them :)
>>
>> --
>> Dr. Mathias Lux
>> Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-
>> itec
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>