Hello Yevgeni, this looks interesting. Can you make a PR to https://github.com/apache/arrow so that Petastorm is listed on https://arrow.apache.org/powered_by/ ?
I browsed a bit through your code. As far as I can see your approach is store to have a set of Parquet files in a directory with a schema that can be translated for Spark, Tensorflow, Torch, … Is this schema persisted in the Parquet file metadata or as a separate file alongside the dataset? Could we extend Arrow's type system a bit to better suit all the frameworks you are targeting. As you had to build a more general schema class, I guess there are definitely things that could not be expressed in Arrow's schema definition. Not sure whether we could extend pyarrow's schema classes to fully support your use case but I would like to understand how to better support it. Uwe On Wed, Sep 26, 2018, at 8:59 PM, Yevgeni Litvin wrote: > Hi, > > My name is Yevgeni Litvin. I am working on ML infra with a small team > within Uber ATG. Our team has recently open sourced Petastorm library. It > heavily relies on Apache Arrow so I wanted to share it with the community. > > The goal of the project is to provide a convenient way for deep learning > community to use Apache Parquet store with sensor data from Tensorflow, > PyTorch or other Python based ML frameworks. > > I believe our use of Parquet is different from mainstream applications as > our field sizes are asymetric (some are huge, such as images, and others > are small) and rowgroup sizes are relatively small (<100). That required > some adaptations. > > We use PyArrow mostly for loading the data. We do see great potential for > further optimizations and speedups by relying more heavily on Arrow as > in-memory store. > > You can find more information about our project here: > > http://eng.uber.com/petastorm/ > https://github.com/uber/petastorm > > Would be more than happy to hear comments, feedback and suggestions! > > Thank you, > > - Yevgeni Litvin