[jira] [Resolved] (SPARK-19582) DataFrameReader conceptually inadequate

Sean Owen (JIRA) Tue, 14 Feb 2017 10:13:23 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-19582.
-------------------------------
    Resolution: Invalid

I don't understand what this is describing. Is it a dependency conflict? if so, 
what? You say DataFrameReader understands every possible source, but it doesn't 
of course. It is also not designed to exclude a particular data source of 
course.

You can already supply a bunch of strings to make a Dataset of strings. 

Is this about Minio? what does 'forward' mean?

Rather than reply, please continue with more clarity on the mailing list. This 
isn't clear or specific enough for a JIRA

> DataFrameReader conceptually inadequate
> ---------------------------------------
>
>                 Key: SPARK-19582
>                 URL: https://issues.apache.org/jira/browse/SPARK-19582
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.1.0
>            Reporter: James Q. Arnold
>
> DataFrameReader assumes it "understands" all data sources (local file system, 
> object stores, jdbc, ...).  This seems limiting in the long term, imposing 
> both development costs to accept new sources and dependency issues for 
> existing sources (how to coordinate the XX jar for internal use vs. the XX 
> jar used by the application).  Unless I have missed how this can be done 
> currently, an application with an unsupported data source cannot create the 
> required RDD for distribution.
> I recommend at least providing a text API for supplying data.  Let the 
> application provide data as a String (or char[] or ...)---not a path, but the 
> actual data.  Alternatively, provide interfaces or abstract classes the 
> application could provide to let the application handle external data 
> sources, without forcing all that complication into the Spark implementation.
> I don't have any code to submit, but JIRA seemed like to most appropriate 
> place to raise the issue.
> Finally, if I have overlooked how this can be done with the current API, a 
> new example would be appreciated.
> Additional detail...
> We use the minio object store, which provides an API compatible with AWS-S3.  
> A few configuration/parameter values differ for minio, but one can use the 
> AWS library in the application to connect to the minio server.
> When trying to use minio objects through spark, the s3://xxx paths are 
> intercepted by spark and handed to hadoop.  So far, I have been unable to 
> find the right combination of configuration values and parameters to 
> "convince" hadoop to forward the right information to work with minio.  If I 
> could read the minio object in the application, and then hand the object 
> contents directly to spark, I could bypass hadoop and solve the problem.  
> Unfortunately, the underlying spark design prevents that.  So, I see two 
> problems.
> -  Spark seems to have taken on the responsibility of "knowing" the API 
> details of all data sources.  This seems iffy in the long run (and is the 
> root of my current problem).  In the long run, it seems unwise to assume that 
> spark should understand all possible path names, protocols, etc.  Moreover, 
> passing S3 paths to hadoop seems a little odd (why not go directly to AWS, 
> for example).  This particular confusion about S3 shows the difficulties that 
> are bound to occur.
> -  Second, spark appears not to have a way to bypass the path name 
> interpretation.  At the least, spark could provide a text/blob interface, 
> letting the application supply the data object and avoid path interpretation 
> inside spark.  Alternatively, spark could accept a reader/stream/... to build 
> the object, again letting the application provide the implementation of the 
> object input.
> As I mentioned above, I might be missing something in the API that lets us 
> work around the problem.  I'll keep looking, but the API as apparently 
> structured seems too limiting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19582) DataFrameReader conceptually inadequate

Reply via email to