Mathew Wicks created SPARK-20353:
------------------------------------
Summary: Implement Tensorflow TFRecords file format
Key: SPARK-20353
URL: https://issues.apache.org/jira/browse/SPARK-20353
Project: Spark
Issue Type: Improvement
Components: Input/Output, SQL
Affects Versions: 2.1.0
Reporter: Mathew Wicks
Spark is a very good prepossessing engine for tools like Tensorflow. However,
we lack native support for Tensorflow's core file format, TFRecords.
There is a project which implements this functionality as an external JAR. (But
is not user friendly, or robust enough for production use.)
https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector
Here is some discussion around the above.
https://github.com/tensorflow/ecosystem/issues/32
If we were to implement "tfrecords" as a data-frame writable/readable format,
we would have to account for the various datatypes that can be present in spark
columns, and which ones are actually useful in Tensorflow.
Note: The `spark-tensorflow-connector` described above, does not properly
support the vector data type.
Further discussion of whether this is within the scope of Spark SQL is strongly
welcomed.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]