Hi All, Recently I have been working with the TensorFlow SIG-IO community to introduce Apache Arrow based Datasets for bringing Arrow data into TensorFlow. SIG-IO is a community maintained repository focused on input/output support for TF, see https://github.com/tensorflow/io (a lot of formats from contrib/ ended up here). Since it is community driven, if anyone is interested, participation is highly encouraged!
I'm bringing this up for a couple reasons. First, I want to make sure that this stays in-line with any related efforts within the Arrow project and welcome any feedback. Secondly, the initial response has been great and people are excited about using Arrow and looking to use it in other areas of TF, but I've noticed there has been some confusion about how Arrow handles tensor data. Specifically, it gets assumed that tensors could be part of a RecordBatch and could be readily used in an Arrow stream. I know we have talked about making tensors a logical type for columnar data before in https://lists.apache.org/thread.html/6cc86d50d92dbd21d6fc34e34485afb3cab4956fbc0d61ff9b99ea27@%3Cdev.arrow.apache.org%3E and there is a JIRA ARROW-1614 <https://issues.apache.org/jira/browse/ARROW-1614>, but since there is work needed to fully support the current spec for 1.0, I don't think it has moved forward much. I'm wondering if maybe now is a better time to start working on this? I think having built-in support for tensor columns would really help to increase adoption of Arrow in frameworks that use tensor data. What are other people's thoughts? Best Regards, Bryan
