[
https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-19582.
-------------------------------
Resolution: Invalid
I don't understand what this is describing. Is it a dependency conflict? if so,
what? You say DataFrameReader understands every possible source, but it doesn't
of course. It is also not designed to exclude a particular data source of
course.
You can already supply a bunch of strings to make a Dataset of strings.
Is this about Minio? what does 'forward' mean?
Rather than reply, please continue with more clarity on the mailing list. This
isn't clear or specific enough for a JIRA
> DataFrameReader conceptually inadequate
> ---------------------------------------
>
> Key: SPARK-19582
> URL: https://issues.apache.org/jira/browse/SPARK-19582
> Project: Spark
> Issue Type: Bug
> Components: Java API
> Affects Versions: 2.1.0
> Reporter: James Q. Arnold
>
> DataFrameReader assumes it "understands" all data sources (local file system,
> object stores, jdbc, ...). This seems limiting in the long term, imposing
> both development costs to accept new sources and dependency issues for
> existing sources (how to coordinate the XX jar for internal use vs. the XX
> jar used by the application). Unless I have missed how this can be done
> currently, an application with an unsupported data source cannot create the
> required RDD for distribution.
> I recommend at least providing a text API for supplying data. Let the
> application provide data as a String (or char[] or ...)---not a path, but the
> actual data. Alternatively, provide interfaces or abstract classes the
> application could provide to let the application handle external data
> sources, without forcing all that complication into the Spark implementation.
> I don't have any code to submit, but JIRA seemed like to most appropriate
> place to raise the issue.
> Finally, if I have overlooked how this can be done with the current API, a
> new example would be appreciated.
> Additional detail...
> We use the minio object store, which provides an API compatible with AWS-S3.
> A few configuration/parameter values differ for minio, but one can use the
> AWS library in the application to connect to the minio server.
> When trying to use minio objects through spark, the s3://xxx paths are
> intercepted by spark and handed to hadoop. So far, I have been unable to
> find the right combination of configuration values and parameters to
> "convince" hadoop to forward the right information to work with minio. If I
> could read the minio object in the application, and then hand the object
> contents directly to spark, I could bypass hadoop and solve the problem.
> Unfortunately, the underlying spark design prevents that. So, I see two
> problems.
> - Spark seems to have taken on the responsibility of "knowing" the API
> details of all data sources. This seems iffy in the long run (and is the
> root of my current problem). In the long run, it seems unwise to assume that
> spark should understand all possible path names, protocols, etc. Moreover,
> passing S3 paths to hadoop seems a little odd (why not go directly to AWS,
> for example). This particular confusion about S3 shows the difficulties that
> are bound to occur.
> - Second, spark appears not to have a way to bypass the path name
> interpretation. At the least, spark could provide a text/blob interface,
> letting the application supply the data object and avoid path interpretation
> inside spark. Alternatively, spark could accept a reader/stream/... to build
> the object, again letting the application provide the implementation of the
> object input.
> As I mentioned above, I might be missing something in the API that lets us
> work around the problem. I'll keep looking, but the API as apparently
> structured seems too limiting.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]