[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

Vinoth Govindarajan (Jira) Thu, 16 Apr 2020 13:56:22 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085253#comment-17085253
 ]


Vinoth Govindarajan commented on HUDI-783:
------------------------------------------

Thanks, [~vinoth]!

We don't need to write any wrapper code in python to use it in pyspark, the 
existing jar files can be packaged and used in pyspark using the data source 
API method. 

To improve the client experience, we need to register the hudi-spark bundle as 
a package on the [https://spark-packages.org/register] website. 
[spark-packages.org|https://spark-packages.org/] is an external, 
community-managed list of third-party libraries, add-ons, and applications that 
work with Apache Spark. 

To register a package, Its content must be hosted by 
[GitHub|https://github.com/] in a public repo under the owner's account, I 
guess you can register the package since you are the owner of the repo. 
Thoughts?

> Add official python support to create hudi datasets using pyspark
> -----------------------------------------------------------------
>
>                 Key: HUDI-783
>                 URL: https://issues.apache.org/jira/browse/HUDI-783
>             Project: Apache Hudi (incubating)
>          Issue Type: Wish
>          Components: Utilities
>            Reporter: Vinoth Govindarajan
>            Assignee: Vinoth Govindarajan
>            Priority: Major
>              Labels: features
>             Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

Reply via email to