Timothy Hunter created SPARK-22666:
--------------------------------------
Summary: Spark reader source for image format
Key: SPARK-22666
URL: https://issues.apache.org/jira/browse/SPARK-22666
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 2.3.0
Reporter: Timothy Hunter
The current API for the new image format is implemented as a standalone
feature, in order to make it reside within the mllib package. As discussed in
SPARK-21866, users should be able to load images through the more common spark
source reader interface.
This ticket is concerned with adding image reading support in the spark source
API, through either of the following interfaces:
- {{spark.read.format("image")...}}
- {{spark.read.image....}}
The output is a dataframe that contains images (and the file names for
example), following the semantics discussed already in SPARK-21866.
A few technical notes:
* since the functionality is implemented in {{mllib}}, calling this function
may fail at runtime if users have not imported the {{spark-mllib}} dependency
* How to deal with very flat directories? It is common to have millions of
files in a single "directory" (like in S3), which seems to have caused some
issues to some users. If this issue is too complex to handle in this ticket, it
can be dealt with separately.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]