[ https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-19582. ------------------------------- Resolution: Invalid I don't understand what this is describing. Is it a dependency conflict? if so, what? You say DataFrameReader understands every possible source, but it doesn't of course. It is also not designed to exclude a particular data source of course. You can already supply a bunch of strings to make a Dataset of strings. Is this about Minio? what does 'forward' mean? Rather than reply, please continue with more clarity on the mailing list. This isn't clear or specific enough for a JIRA > DataFrameReader conceptually inadequate > --------------------------------------- > > Key: SPARK-19582 > URL: https://issues.apache.org/jira/browse/SPARK-19582 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.1.0 > Reporter: James Q. Arnold > > DataFrameReader assumes it "understands" all data sources (local file system, > object stores, jdbc, ...). This seems limiting in the long term, imposing > both development costs to accept new sources and dependency issues for > existing sources (how to coordinate the XX jar for internal use vs. the XX > jar used by the application). Unless I have missed how this can be done > currently, an application with an unsupported data source cannot create the > required RDD for distribution. > I recommend at least providing a text API for supplying data. Let the > application provide data as a String (or char[] or ...)---not a path, but the > actual data. Alternatively, provide interfaces or abstract classes the > application could provide to let the application handle external data > sources, without forcing all that complication into the Spark implementation. > I don't have any code to submit, but JIRA seemed like to most appropriate > place to raise the issue. > Finally, if I have overlooked how this can be done with the current API, a > new example would be appreciated. > Additional detail... > We use the minio object store, which provides an API compatible with AWS-S3. > A few configuration/parameter values differ for minio, but one can use the > AWS library in the application to connect to the minio server. > When trying to use minio objects through spark, the s3://xxx paths are > intercepted by spark and handed to hadoop. So far, I have been unable to > find the right combination of configuration values and parameters to > "convince" hadoop to forward the right information to work with minio. If I > could read the minio object in the application, and then hand the object > contents directly to spark, I could bypass hadoop and solve the problem. > Unfortunately, the underlying spark design prevents that. So, I see two > problems. > - Spark seems to have taken on the responsibility of "knowing" the API > details of all data sources. This seems iffy in the long run (and is the > root of my current problem). In the long run, it seems unwise to assume that > spark should understand all possible path names, protocols, etc. Moreover, > passing S3 paths to hadoop seems a little odd (why not go directly to AWS, > for example). This particular confusion about S3 shows the difficulties that > are bound to occur. > - Second, spark appears not to have a way to bypass the path name > interpretation. At the least, spark could provide a text/blob interface, > letting the application supply the data object and avoid path interpretation > inside spark. Alternatively, spark could accept a reader/stream/... to build > the object, again letting the application provide the implementation of the > object input. > As I mentioned above, I might be missing something in the API that lets us > work around the problem. I'll keep looking, but the API as apparently > structured seems too limiting. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org