tensorflow-io Arrow Datasets and thoughts on support for tensor columns

Bryan Cutler Fri, 22 Mar 2019 11:24:41 -0700

Hi All,

Recently I have been working with the TensorFlow SIG-IO community to
introduce Apache Arrow based Datasets for bringing Arrow data into
TensorFlow. SIG-IO is a community maintained repository focused on
input/output support for TF, see https://github.com/tensorflow/io (a lot of
formats from contrib/ ended up here).  Since it is community driven, if
anyone is interested, participation is highly encouraged!


I'm bringing this up for a couple reasons. First, I want to make sure that
this stays in-line with any related efforts within the Arrow project and
welcome any feedback. Secondly, the initial response has been great and
people are excited about using Arrow and looking to use it in other areas
of TF, but I've noticed there has been some confusion about how Arrow
handles tensor data. Specifically, it gets assumed that tensors could be
part of a RecordBatch and could be readily used in an Arrow stream.

I know we have talked about making tensors a logical type for columnar data
before in
https://lists.apache.org/thread.html/6cc86d50d92dbd21d6fc34e34485afb3cab4956fbc0d61ff9b99ea27@%3Cdev.arrow.apache.org%3E
and there is a JIRA ARROW-1614
<https://issues.apache.org/jira/browse/ARROW-1614>, but since there is work
needed to fully support the current spec for 1.0, I don't think it has
moved forward much. I'm wondering if maybe now is a better time to start
working on this?  I think having built-in support for tensor columns would
really help to increase adoption of Arrow in frameworks that use tensor
data. What are other people's thoughts?

Best Regards,
Bryan

tensorflow-io Arrow Datasets and thoughts on support for tensor columns

Reply via email to