Hi, My name is Yevgeni Litvin. I am working on ML infra with a small team within Uber ATG. Our team has recently open sourced Petastorm library. It heavily relies on Apache Arrow so I wanted to share it with the community.
The goal of the project is to provide a convenient way for deep learning community to use Apache Parquet store with sensor data from Tensorflow, PyTorch or other Python based ML frameworks. I believe our use of Parquet is different from mainstream applications as our field sizes are asymetric (some are huge, such as images, and others are small) and rowgroup sizes are relatively small (<100). That required some adaptations. We use PyArrow mostly for loading the data. We do see great potential for further optimizations and speedups by relying more heavily on Arrow as in-memory store. You can find more information about our project here: http://eng.uber.com/petastorm/ https://github.com/uber/petastorm Would be more than happy to hear comments, feedback and suggestions! Thank you, - Yevgeni Litvin