rahil-c opened a new issue, #553: URL: https://github.com/apache/parquet-format/issues/553
### Describe the usage question you have. Please include as many useful details as possible. Hi Parquet community, Im an engineer that works in open source data infra primarily in the java ecosystem. I wanted to ask the community some questions for how to better configure/tune parquet today for storing and retrieval of vectors. Also interested in the future work around this item as well but for now curious on what users can tweak. Typically from what i've seen most models will generate vector embeddings as an array of floating point values, with dimensions around 700-1500 elements (taking up around 3KB - 6KB per vector), so my questions will be based on this input. * https://developers.openai.com/api/docs/guides/embeddings * https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings 1. Since there is currently no native logical `VECTOR` type within Parquet, from the existing types what is the recommended data to use type for writing vectors(assumption is most users today would try parquet's `LIST with `FLOAT` but is there others way to represent this better)? Is there plans for Parquet to add a type such as this in the future? 2. Since vectors are high cardinality encodings such as `DICTIONARY, RLE` might not be as useful so is there a recommended encoding to leverage? 3. Is there a recommended compression codec to use for vectors? If not, is there a way to disable compression per column within parquet java? 4. Should users be disabling stats on these columns? 5. Is there a recommendation for tuning row group and page size for vectors? Thanks again for your assistance and help, if there is any roadmap specifically you can point me to would be highly appreciated in terms of changes happening in parquet around this. cc @julienledem @emkornfield -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
