the-other-tim-brown commented on code in PR #18184:
URL: https://github.com/apache/hudi/pull/18184#discussion_r2805055775


##########
rfc/rfc-99/rfc-99.md:
##########
@@ -210,3 +211,278 @@ SQL Extensions needs to be added to define the table in a 
hudi type native way.
 TODO: There is an open question regarding the need to maintain type ids to 
track schema evolution and how it would interplay with NBCC. 
 
 The main implementation change would require replacing the Avro schema 
references with the new type system. 
+
+
+## Supporting VECTOR type in Hudi
+This section captures additional research and design notes for supporting a 
VECTOR logical type in Hudi. See appendix for more details on research sources.
+
+### Initial scope
+
+The intial use case we are targeting for `VECTOR` within Hudi, 
+is to enable KNN style vector search functionality to be performed on 
blobs(large text, images, audio, video) alongside their generated vector 
embeddings.
+Typically vector search is popular for Retrieval-Augmented Generation (RAG) 
applications
+which provide relevant context to an LLM in order to improve its accuracy when 
answering user queries. 
+The vector embeddings generated by frontier models are usually in the form of 
an array of floating point values.
+
+### Dense vectors vs sparse vectors
+
+***Dense vector***
+* Has a value for every dimension.
+* Stored as a full length-D sequence: v = [0.12, -0.03, 0.44, ...] (length = D)
+* Even if some entries are 0, you still store them.
+
+***Sparse vector***
+* Most entries are 0 / absent, so you store only the non-zero positions.
+* Stored as pairs (index, value), sometimes also with a separate nnz count:
+* [(3, 0.44), (107, 1.2), (9012, -0.7)]
+* The “dimension” is still D, but the stored length is nnz (number of 
non-zeros), typically nnz << D.
+
+Sparse vectors become important for other types of hybrid/lexical-style 
retrieval which is not targeted for the intial scope, 
+as that requires running different algorithms such as (TF-IDF or BM25) which 
is different from the intial use case of KNN style search.  
+Hence this RFC has seperated both into two distinct types one for VECTOR 
(dense) and one for SPARSE_VECTOR, we will for now spend time on VECTOR dense 
case. 
+
+
+### Vector Schema constraints
+
+**Logical level requirements:**
+- All values within the VECTOR column must have the same **dimension** i.e 
(number of elements within the vector), as this is needed to perform 
cosine/L2/dot-product correctly. 
+- There should be no null elements within the vector at write time.
+- VECTOR must have an "element type" which can be one of `FLOAT`, `DOUBLE` or 
`INT8`.
+- We also want to keep a property around such as `storageBacking` which lets 
the writers know how to serialize the vector to disk. For an intial approach we 
will start with a fixed bytes approach covered below. 
+
+See the following avro schema model as a general example:
+```
+{
+  "type" : "fixed",
+  "name" : "vector",
+  "size" : 3072,
+  "logicalType" : "vector",
+  "dimension" : 768,
+  "elementType" : "FLOAT",
+  "storageBacking" : "FIXED_BYTES"
+}
+
+```
+
+
+**Physical level requirements:**
+
+For now we will support a fixed-size packed byte representation for storing 
vectors on disk as this yields optimal performance(see parquet tuning section 
below for more details): 
+- FLOAT32 vector of dimension D stored as exactly `D * 4` bytes (IEEE-754 
float32, little-endian)
+- Map to Parquet `FIXED_LEN_BYTE_ARRAY(D * 4)` with VECTOR metadata.
+- For Lance, vectors are typically represented using Arrow's `FixedSizedList`
+
+## Optimal Parquet tuning for vectors:
+Vector data is typically high-cardinality and not dictionary-friendly. 
Therefore we will be disabling dictionary encoding and column stats for vector 
columns. 
+Also based on findings from the parquet community, encodings such as `PLAIN` 
or `BYTE_STREAM_SPLIT` are useful when dealing with vectors, as well as 
disabling compression
+as this would yield best write/read performance.
+
+***Benchmark experiment with vectors***
+
+* The results below was from an experiment writing `10,000` vectors (where 
each vector dimension is 1,536 and the element type is FLOAT(4 bytes), around 
6KB per record).
+* We performed a full round trip for writing all vectors to a file and then 
read it back, using PARQUET and LANCE's java file writers/readers.
+* For PARQUET we tried several combination of writing with different types,as 
well as tried different encodings, compressions, etc to handle vectors.
+* For LANCE we opted to use vanilla settings based on it claims already toward 
already handling vectors optimally.
+* We performed 5 warmup rounds and 10 measurement rounds and collected 
averages below.
+
+
+***Physical backings tested***
+* Parquet LIST: Vectors stored as Parquet's LIST<FLOAT> type (variable-length 
array)
+* Parquet FIXED: Vectors stored as Parquet's FIXED_LEN_BYTE_ARRAY (fixed 6,144 
bytes for 1,536 floats)
+* Lance: Vectors stored in Lance format using FixedSizeList<Float32>
+
+***Summary of Results***
+```
+Winner (most compact file size): Parquet LIST (byte-stream-split, ZSTD)
+
+Currently parquet list is only a couple of MB more compact then the other 
parquet fixed tests.
+
+Performance Winner (Write): Lance
+Performance Winner (Read):  Parquet FIXED (byte-stream-split, UNCOMPRESSED)
+
+*Note* Parquet FIXED and Lance are close in write perf
+```
+
+***Detailed comparison table***

Review Comment:
   Use the table markdown so the tables will render properly



##########
rfc/rfc-99/rfc-99.md:
##########
@@ -210,3 +211,278 @@ SQL Extensions needs to be added to define the table in a 
hudi type native way.
 TODO: There is an open question regarding the need to maintain type ids to 
track schema evolution and how it would interplay with NBCC. 
 
 The main implementation change would require replacing the Avro schema 
references with the new type system. 
+
+
+## Supporting VECTOR type in Hudi
+This section captures additional research and design notes for supporting a 
VECTOR logical type in Hudi. See appendix for more details on research sources.
+
+### Initial scope
+
+The intial use case we are targeting for `VECTOR` within Hudi, 
+is to enable KNN style vector search functionality to be performed on 
blobs(large text, images, audio, video) alongside their generated vector 
embeddings.
+Typically vector search is popular for Retrieval-Augmented Generation (RAG) 
applications
+which provide relevant context to an LLM in order to improve its accuracy when 
answering user queries. 
+The vector embeddings generated by frontier models are usually in the form of 
an array of floating point values.
+
+### Dense vectors vs sparse vectors
+
+***Dense vector***
+* Has a value for every dimension.
+* Stored as a full length-D sequence: v = [0.12, -0.03, 0.44, ...] (length = D)
+* Even if some entries are 0, you still store them.
+
+***Sparse vector***
+* Most entries are 0 / absent, so you store only the non-zero positions.
+* Stored as pairs (index, value), sometimes also with a separate nnz count:
+* [(3, 0.44), (107, 1.2), (9012, -0.7)]
+* The “dimension” is still D, but the stored length is nnz (number of 
non-zeros), typically nnz << D.
+
+Sparse vectors become important for other types of hybrid/lexical-style 
retrieval which is not targeted for the intial scope, 
+as that requires running different algorithms such as (TF-IDF or BM25) which 
is different from the intial use case of KNN style search.  
+Hence this RFC has seperated both into two distinct types one for VECTOR 
(dense) and one for SPARSE_VECTOR, we will for now spend time on VECTOR dense 
case. 
+
+
+### Vector Schema constraints
+
+**Logical level requirements:**
+- All values within the VECTOR column must have the same **dimension** i.e 
(number of elements within the vector), as this is needed to perform 
cosine/L2/dot-product correctly. 
+- There should be no null elements within the vector at write time.
+- VECTOR must have an "element type" which can be one of `FLOAT`, `DOUBLE` or 
`INT8`.
+- We also want to keep a property around such as `storageBacking` which lets 
the writers know how to serialize the vector to disk. For an intial approach we 
will start with a fixed bytes approach covered below. 
+
+See the following avro schema model as a general example:
+```
+{
+  "type" : "fixed",
+  "name" : "vector",
+  "size" : 3072,
+  "logicalType" : "vector",
+  "dimension" : 768,
+  "elementType" : "FLOAT",
+  "storageBacking" : "FIXED_BYTES"
+}
+
+```
+
+
+**Physical level requirements:**
+
+For now we will support a fixed-size packed byte representation for storing 
vectors on disk as this yields optimal performance(see parquet tuning section 
below for more details): 
+- FLOAT32 vector of dimension D stored as exactly `D * 4` bytes (IEEE-754 
float32, little-endian)
+- Map to Parquet `FIXED_LEN_BYTE_ARRAY(D * 4)` with VECTOR metadata.
+- For Lance, vectors are typically represented using Arrow's `FixedSizedList`
+
+## Optimal Parquet tuning for vectors:

Review Comment:
   @rahil-c can you move the research to another markdown file under rfc-99 and 
keep this file limited to the conclusions with a link to the research?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to