Petastorm: PyArrow based library for Tensorflow, PyTorch and others...

Yevgeni Litvin Wed, 26 Sep 2018 12:00:22 -0700

Hi,

My name is Yevgeni Litvin. I am working on ML infra with a small team
within Uber ATG. Our team has recently open sourced Petastorm library. It
heavily relies on Apache Arrow so I wanted to share it with the community.


The goal of the project is to provide a convenient way for deep learning
community to use Apache Parquet store with sensor data from Tensorflow,
PyTorch or other Python based ML frameworks.

I believe our use of Parquet is different from mainstream applications as
our field sizes are asymetric (some are huge, such as images, and others
are small) and rowgroup sizes are relatively small (<100). That required
some adaptations.

We use PyArrow mostly for loading the data. We do see great potential for
further optimizations and speedups by relying more heavily on Arrow as
in-memory store.

You can find more information about our project here:

http://eng.uber.com/petastorm/
https://github.com/uber/petastorm

Would be more than happy to hear comments, feedback and suggestions!

Thank you,

- Yevgeni Litvin

Petastorm: PyArrow based library for Tensorflow, PyTorch and others...

Reply via email to