Re: Using hudi with pyspark

Rodrigo Dominguez Wed, 11 Sep 2019 01:56:16 -0700

Hi Kabeer

I was able to build a simple script on python, and submit it with:


spark-submit --jars 
$HUDI_SRC/packaging/hoodie-spark-bundle/target/hoodie-spark-bundle-0.4.7.jar 
--packages com.databricks:spark-avro_2.11:4.0.0 --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer ./test.py 

Yes, the idea is to use upsert, I’ll take a look at the project.

Thank you

Rodrigo Dominguez
www.rorra.com.ar


> On Sep 10, 2019, at 10:03 PM, Kabeer Ahmed <[email protected]> wrote:
> 
> Hi Rodrigo,
> 
> Welcome to the HUDI users group. The entire Hudi code base is Java and Scala 
> based. But there is nothing stopping you from using it through Python 
> (pyspark). You should be able to copy all the packaging jars into your Spark 
> installation and use them. But please note that you wouldnt be able to define 
> your own CombineAndUpdate logic (as far as I know). For eg: if you wanted to 
> write your own logic to compare the records that are being ingested to the 
> ones persisted, I am not aware how to write them when using PySpark.
> If you are only after running using Python to use HUDI to run upsert use 
> cases, then I would highly recommend that you look into the Metorikku project 
> at: https://github.com/YotpoLtd/metorikku 
> (https://link.getmailspring.com/link/[email protected]/0?redirect=https%3A%2F%2Fgithub.com%2FYotpoLtd%2Fmetorikku&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D).
>  The project does quite a lot without writing any code at all. It is based on 
> HUDI.
> If you are still after a Python example, then I can try to write one and 
> share it with you.
> Hope this helps,
> Kabeer.
> 
> On Sep 10 2019, at 4:07 pm, Rodrigo Dominguez <[email protected]> wrote:
>> I’m new to Hudi, and I’m wondering whether I can use it with python 
>> (pyspark) and the way to use it.
>> 
>> I was able to download the source code, compile the project, run the Scala 
>> and java samples, but didn’t see any single Python source code and I’m 
>> wondering whether this is possible.
>> Thank you
>> Rodrigo Dominguez
>> [email protected]
>> 
>

Re: Using hudi with pyspark

Reply via email to