[
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080527#comment-17080527
]
Vinoth Chandar commented on HUDI-783:
-------------------------------------
This is great stuff.. Look forward to this.. We could probably use this as a
parent task and create sub tasks under neath them, for each of those things you
mention?
> Add official python support to create hudi datasets using pyspark
> -----------------------------------------------------------------
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi (incubating)
> Issue Type: Wish
> Components: Utilities
> Reporter: Vinoth Govindarajan
> Priority: Major
> Labels: features
> Fix For: 0.6.0
>
>
> *Goal:*
> As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
> # Create a hudi-pyspark package that users can import and start
> reading/writing hudi datasets.
> # Explain how to read/write hudi datasets using pyspark in a blog
> post/documentation.
> # Add the hudi-pyspark module to the hudi demo docker along with the
> instructions.
> # Make the package available as part of the [spark packages
> index|https://spark-packages.org/] and [python package
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark
> using which HUDI files can be read as DataFrame and write to any Hadoop
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)