Re: Store arrays in DocValues and keep the original order
You're correct that these doc value fields are primarily meant for sorting, as well as some other use-cases like faceting. And what you're discovered is also correct, that these fields don't maintain the original ordering, and SORTED_SET dedupes values ( https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/index/DocValuesType.html ). There's no technical reason new doc value types couldn't be added that maintain original ordering and don't dedupe, but whether-or-not there are enough use-cases to support that need is a question that would need to be considered. +1 to Shai's suggestion to build on BinaryDocValues. By extending BinaryDocValuesField, you can encode the doc values however you like. An example of this can be seen here: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/IntRangeDocValuesField.java Hope this helps. Cheers, -Greg On Tue, Jun 28, 2022 at 5:52 AM Shai Erera wrote: > Depending on what you use the field for, you can use BinaryDocValuesField > which encodes a byte[] and lets you store the data however you want. But > how are you using these fields later at search time? > > On Tue, Jun 28, 2022 at 3:46 PM linfeng lu wrote: > >> Hi~ >> >> We are trying to build an OLAP database based on lucene, and we heavily >> use lucene's *DocValues* (as our column store). >> >> *We try to use DocValues to store the array type field. *For example, if >> we want to store the *field1* and *feild2* in this json document into >> *DocValues* respectively, SORTED_NUMERIC and SORTED_SET seem to be our >> only option. >> >> *{* >> *"field1": [ 3, 1, 1, 2 ], * >> *"field2": [ "c", "a", "a", "b" ] * >> *}* >> >> >> When we store *field1* in SORTED_NUMERIC and *field2* in SORTED_SET, we >> will get this result: >> >> *[image: Community Verified icon]* >> >> field1: >> >>- origin: [3, 1, 1, 2] >>- in SORTED_NUMERIC: [1, 1, 2, 3] >> >> field2: >> >>- origin: [”c”, “a”, “a”, “b” ] >>- in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”] >> >> >> The original ordering relationship of the elements in the array is lost. >> >> We're guessing that lucene's DocValues are designed primarily for sorting >> and aggregation, so the original order of elements may not matter. >> >> But in our usage scene, it is important to keep the original order of >> the elements in the array (we allow user to access the elements in the >> array using the subscript operator). >> >> We wonder if lucene has plans to add new types of DocValues that can >> store arrays and keep the original order of elements in the array? >> >> Thanks! >> >
Re: Store arrays in DocValues and keep the original order
Depending on what you use the field for, you can use BinaryDocValuesField which encodes a byte[] and lets you store the data however you want. But how are you using these fields later at search time? On Tue, Jun 28, 2022 at 3:46 PM linfeng lu wrote: > Hi~ > > We are trying to build an OLAP database based on lucene, and we heavily > use lucene's *DocValues* (as our column store). > > *We try to use DocValues to store the array type field. *For example, if > we want to store the *field1* and *feild2* in this json document into > *DocValues* respectively, SORTED_NUMERIC and SORTED_SET seem to be our > only option. > > *{* > *"field1": [ 3, 1, 1, 2 ], * > *"field2": [ "c", "a", "a", "b" ] * > *}* > > > When we store *field1* in SORTED_NUMERIC and *field2* in SORTED_SET, we > will get this result: > > *[image: Community Verified icon]* > > field1: > >- origin: [3, 1, 1, 2] >- in SORTED_NUMERIC: [1, 1, 2, 3] > > field2: > >- origin: [”c”, “a”, “a”, “b” ] >- in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”] > > > The original ordering relationship of the elements in the array is lost. > > We're guessing that lucene's DocValues are designed primarily for sorting > and aggregation, so the original order of elements may not matter. > > But in our usage scene, it is important to keep the original order of the > elements in the array (we allow user to access the elements in the array > using the subscript operator). > > We wonder if lucene has plans to add new types of DocValues that can store > arrays and keep the original order of elements in the array? > > Thanks! >
Store arrays in DocValues and keep the original order
Hi~ We are trying to build an OLAP database based on lucene, and we heavily use lucene's DocValues (as our column store). We try to use DocValues to store the array type field. For example, if we want to store the field1 and feild2 in this json document into DocValues respectively, SORTED_NUMERIC and SORTED_SET seem to be our only option. { "field1": [ 3, 1, 1, 2 ], "field2": [ "c", "a", "a", "b" ] } When we store field1 in SORTED_NUMERIC and field2 in SORTED_SET, we will get this result: [Community Verified icon] field1: * origin: [3, 1, 1, 2] * in SORTED_NUMERIC: [1, 1, 2, 3] field2: * origin: [”c”, “a”, “a”, “b” ] * in SORTED_SET: ords [0, 1, 2] terms [”a”, “b”, “c”] The original ordering relationship of the elements in the array is lost. We're guessing that lucene's DocValues are designed primarily for sorting and aggregation, so the original order of elements may not matter. But in our usage scene, it is important to keep the original order of the elements in the array (we allow user to access the elements in the array using the subscript operator). We wonder if lucene has plans to add new types of DocValues that can store arrays and keep the original order of elements in the array? Thanks!