Vinoth Govindarajan created HUDI-783:
----------------------------------------

             Summary: Add official python support to create hudi datasets using 
pyspark
                 Key: HUDI-783
                 URL: https://issues.apache.org/jira/browse/HUDI-783
             Project: Apache Hudi (incubating)
          Issue Type: Wish
          Components: Utilities
            Reporter: Vinoth Govindarajan
             Fix For: 0.6.0


*Goal:*
As a pyspark user, I would like to read/write hudi datasets using pyspark.

There are several components to achieve this goal.
 # Create a hudi-pyspark package that users can import and start 
reading/writing hudi datasets.
 # Explain how to read/write hudi datasets using pyspark in a blog 
post/documentation.
 # Add the hudi-pyspark module to the hudi demo docker along with the 
instructions.
 # Make the package available as part of the [spark packages 
index|https://spark-packages.org/] and [python package 
index|[https://pypi.org/].]

hudi-pyspark packages should implement HUDI data source API for Apache Spark 
using which HUDI files can be read as DataFrame and write to any Hadoop 
supported file system.

Usage pattern after we launch this feature should be something like this:

Install the package using:
{code:java}
pip install hudi-pyspark{code}
or

Include hudi-pyspark package in your Spark Applications using:

spark-shell, pyspark, or spark-submit
{code:java}
> $SPARK_HOME/bin/spark-shell --packages 
> org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to