the-other-tim-brown commented on code in PR #18184:
URL: https://github.com/apache/hudi/pull/18184#discussion_r2805055775
##########
rfc/rfc-99/rfc-99.md:
##########
@@ -210,3 +211,278 @@ SQL Extensions needs to be added to define the table in a
hudi type native way.
TODO: There is an open question regarding the need to maintain type ids to
track schema evolution and how it would interplay with NBCC.
The main implementation change would require replacing the Avro schema
references with the new type system.
+
+
+## Supporting VECTOR type in Hudi
+This section captures additional research and design notes for supporting a
VECTOR logical type in Hudi. See appendix for more details on research sources.
+
+### Initial scope
+
+The intial use case we are targeting for `VECTOR` within Hudi,
+is to enable KNN style vector search functionality to be performed on
blobs(large text, images, audio, video) alongside their generated vector
embeddings.
+Typically vector search is popular for Retrieval-Augmented Generation (RAG)
applications
+which provide relevant context to an LLM in order to improve its accuracy when
answering user queries.
+The vector embeddings generated by frontier models are usually in the form of
an array of floating point values.
+
+### Dense vectors vs sparse vectors
+
+***Dense vector***
+* Has a value for every dimension.
+* Stored as a full length-D sequence: v = [0.12, -0.03, 0.44, ...] (length = D)
+* Even if some entries are 0, you still store them.
+
+***Sparse vector***
+* Most entries are 0 / absent, so you store only the non-zero positions.
+* Stored as pairs (index, value), sometimes also with a separate nnz count:
+* [(3, 0.44), (107, 1.2), (9012, -0.7)]
+* The “dimension” is still D, but the stored length is nnz (number of
non-zeros), typically nnz << D.
+
+Sparse vectors become important for other types of hybrid/lexical-style
retrieval which is not targeted for the intial scope,
+as that requires running different algorithms such as (TF-IDF or BM25) which
is different from the intial use case of KNN style search.
+Hence this RFC has seperated both into two distinct types one for VECTOR
(dense) and one for SPARSE_VECTOR, we will for now spend time on VECTOR dense
case.
+
+
+### Vector Schema constraints
+
+**Logical level requirements:**
+- All values within the VECTOR column must have the same **dimension** i.e
(number of elements within the vector), as this is needed to perform
cosine/L2/dot-product correctly.
+- There should be no null elements within the vector at write time.
+- VECTOR must have an "element type" which can be one of `FLOAT`, `DOUBLE` or
`INT8`.
+- We also want to keep a property around such as `storageBacking` which lets
the writers know how to serialize the vector to disk. For an intial approach we
will start with a fixed bytes approach covered below.
+
+See the following avro schema model as a general example:
+```
+{
+ "type" : "fixed",
+ "name" : "vector",
+ "size" : 3072,
+ "logicalType" : "vector",
+ "dimension" : 768,
+ "elementType" : "FLOAT",
+ "storageBacking" : "FIXED_BYTES"
+}
+
+```
+
+
+**Physical level requirements:**
+
+For now we will support a fixed-size packed byte representation for storing
vectors on disk as this yields optimal performance(see parquet tuning section
below for more details):
+- FLOAT32 vector of dimension D stored as exactly `D * 4` bytes (IEEE-754
float32, little-endian)
+- Map to Parquet `FIXED_LEN_BYTE_ARRAY(D * 4)` with VECTOR metadata.
+- For Lance, vectors are typically represented using Arrow's `FixedSizedList`
+
+## Optimal Parquet tuning for vectors:
+Vector data is typically high-cardinality and not dictionary-friendly.
Therefore we will be disabling dictionary encoding and column stats for vector
columns.
+Also based on findings from the parquet community, encodings such as `PLAIN`
or `BYTE_STREAM_SPLIT` are useful when dealing with vectors, as well as
disabling compression
+as this would yield best write/read performance.
+
+***Benchmark experiment with vectors***
+
+* The results below was from an experiment writing `10,000` vectors (where
each vector dimension is 1,536 and the element type is FLOAT(4 bytes), around
6KB per record).
+* We performed a full round trip for writing all vectors to a file and then
read it back, using PARQUET and LANCE's java file writers/readers.
+* For PARQUET we tried several combination of writing with different types,as
well as tried different encodings, compressions, etc to handle vectors.
+* For LANCE we opted to use vanilla settings based on it claims already toward
already handling vectors optimally.
+* We performed 5 warmup rounds and 10 measurement rounds and collected
averages below.
+
+
+***Physical backings tested***
+* Parquet LIST: Vectors stored as Parquet's LIST<FLOAT> type (variable-length
array)
+* Parquet FIXED: Vectors stored as Parquet's FIXED_LEN_BYTE_ARRAY (fixed 6,144
bytes for 1,536 floats)
+* Lance: Vectors stored in Lance format using FixedSizeList<Float32>
+
+***Summary of Results***
+```
+Winner (most compact file size): Parquet LIST (byte-stream-split, ZSTD)
+
+Currently parquet list is only a couple of MB more compact then the other
parquet fixed tests.
+
+Performance Winner (Write): Lance
+Performance Winner (Read): Parquet FIXED (byte-stream-split, UNCOMPRESSED)
+
+*Note* Parquet FIXED and Lance are close in write perf
+```
+
+***Detailed comparison table***
Review Comment:
Use the table markdown so the tables will render properly
##########
rfc/rfc-99/rfc-99.md:
##########
@@ -210,3 +211,278 @@ SQL Extensions needs to be added to define the table in a
hudi type native way.
TODO: There is an open question regarding the need to maintain type ids to
track schema evolution and how it would interplay with NBCC.
The main implementation change would require replacing the Avro schema
references with the new type system.
+
+
+## Supporting VECTOR type in Hudi
+This section captures additional research and design notes for supporting a
VECTOR logical type in Hudi. See appendix for more details on research sources.
+
+### Initial scope
+
+The intial use case we are targeting for `VECTOR` within Hudi,
+is to enable KNN style vector search functionality to be performed on
blobs(large text, images, audio, video) alongside their generated vector
embeddings.
+Typically vector search is popular for Retrieval-Augmented Generation (RAG)
applications
+which provide relevant context to an LLM in order to improve its accuracy when
answering user queries.
+The vector embeddings generated by frontier models are usually in the form of
an array of floating point values.
+
+### Dense vectors vs sparse vectors
+
+***Dense vector***
+* Has a value for every dimension.
+* Stored as a full length-D sequence: v = [0.12, -0.03, 0.44, ...] (length = D)
+* Even if some entries are 0, you still store them.
+
+***Sparse vector***
+* Most entries are 0 / absent, so you store only the non-zero positions.
+* Stored as pairs (index, value), sometimes also with a separate nnz count:
+* [(3, 0.44), (107, 1.2), (9012, -0.7)]
+* The “dimension” is still D, but the stored length is nnz (number of
non-zeros), typically nnz << D.
+
+Sparse vectors become important for other types of hybrid/lexical-style
retrieval which is not targeted for the intial scope,
+as that requires running different algorithms such as (TF-IDF or BM25) which
is different from the intial use case of KNN style search.
+Hence this RFC has seperated both into two distinct types one for VECTOR
(dense) and one for SPARSE_VECTOR, we will for now spend time on VECTOR dense
case.
+
+
+### Vector Schema constraints
+
+**Logical level requirements:**
+- All values within the VECTOR column must have the same **dimension** i.e
(number of elements within the vector), as this is needed to perform
cosine/L2/dot-product correctly.
+- There should be no null elements within the vector at write time.
+- VECTOR must have an "element type" which can be one of `FLOAT`, `DOUBLE` or
`INT8`.
+- We also want to keep a property around such as `storageBacking` which lets
the writers know how to serialize the vector to disk. For an intial approach we
will start with a fixed bytes approach covered below.
+
+See the following avro schema model as a general example:
+```
+{
+ "type" : "fixed",
+ "name" : "vector",
+ "size" : 3072,
+ "logicalType" : "vector",
+ "dimension" : 768,
+ "elementType" : "FLOAT",
+ "storageBacking" : "FIXED_BYTES"
+}
+
+```
+
+
+**Physical level requirements:**
+
+For now we will support a fixed-size packed byte representation for storing
vectors on disk as this yields optimal performance(see parquet tuning section
below for more details):
+- FLOAT32 vector of dimension D stored as exactly `D * 4` bytes (IEEE-754
float32, little-endian)
+- Map to Parquet `FIXED_LEN_BYTE_ARRAY(D * 4)` with VECTOR metadata.
+- For Lance, vectors are typically represented using Arrow's `FixedSizedList`
+
+## Optimal Parquet tuning for vectors:
Review Comment:
@rahil-c can you move the research to another markdown file under rfc-99 and
keep this file limited to the conclusions with a link to the research?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]