[
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998759#comment-14998759
]
Jeff Zhang commented on SPARK-10388:
------------------------------------
[~mengxr] I talked with [~rams] offline, and would love to collaborate with him
on this ticket. I attach the design, please help review. Thanks
> Public dataset loader interface
> -------------------------------
>
> Key: SPARK-10388
> URL: https://issues.apache.org/jira/browse/SPARK-10388
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: Xiangrui Meng
> Assignee: Xiangrui Meng
> Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design,
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the
> API and implementation are pending discussion. Note that this requires http
> and https support.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]