Hi Udit, Thanks for your recommendation. I was able to get the jars for 0.5.1. As a test we ran hudi against a small dataset (~2 million rows with 80 columns) in parquet file against 10 executors (m5.xlarge) . The initial load itself is taking 2+ hours. Do you have any suggestions on the settings I can update to speed up the process.
Thanks Syed Zaidi ________________________________ From: Mehrotra, Udit <[email protected]> Sent: Tuesday, March 17, 2020 8:08 PM To: [email protected] <[email protected]>; Syed Zaidi <[email protected]> Subject: Re: Question Hi Zaidi, You should be able to use Hudi 0.5.1 in the next EMR release that should be fairly soon, but we can't give you an ETA. Meanwhile, there is nothing really stopping you to build your hudi 0.5.1 jars and replacing the ones on EMR cluster. The jars are located on the master node at /usr/lib/hudi/. Just replace the 0.5.0 jars there and have the symlink jars point to your 0.5.1 jars. Thanks, Udit Mehrotra SDE | AWS EMR On 3/17/20, 5:34 PM, "Syed Zaidi" <[email protected]> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi, AWS EMR emr-5.29.0 comes with spark Spark 2.4.4 and Hudi 0.5.0 ( hudi-hadoop-mr-bundle-0.5.0-incubating.jar). In version 0.5.1 we have new options for reading the AWS DMS change logs using DeltaStreamer. Do you guys have any idea when will AWS support the newer version of hudi. What options I have to upgrade hudi to the latest version while creating the EMR to support AWS DMS payload out of the box. Would appreciate your feedback in this regard. Thanks Syed Zaidi
