[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

Vinoth Chandar (Jira) Fri, 10 Apr 2020 07:34:28 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080527#comment-17080527
 ]


Vinoth Chandar commented on HUDI-783:
-------------------------------------

This is great stuff.. Look forward to this.. We could probably use this as a 
parent task and create sub tasks under neath them, for each of those things you 
mention?

> Add official python support to create hudi datasets using pyspark
> -----------------------------------------------------------------
>
>                 Key: HUDI-783
>                 URL: https://issues.apache.org/jira/browse/HUDI-783
>             Project: Apache Hudi (incubating)
>          Issue Type: Wish
>          Components: Utilities
>            Reporter: Vinoth Govindarajan
>            Priority: Major
>              Labels: features
>             Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-783) Add official python support to create hudi datasets using pyspark

Reply via email to