Hi Syed, Please join the mailing list, so your responses make it here without needed approval.
I am sure there is something odd going on here. Few things to check - Hudi does use memory for caching inputs and computing heuristics. I have seen slowness being caused by insufficient executor memory. Can you try a larger heap size and configuring GC? (explained in https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide) - There is also a performance bug we fixed in 0.5.2. Can you try setting hoodie.memory.merge.max.size=2147483648 (2GB of merge memory). (initial load should be just doing an insert, so may be unrelated. still something to keep in mind) If you can open a GitHub issue with the spark UI screenshot and data size etc, happy to take a look. thanks vinoth On Wed, Mar 18, 2020 at 4:37 PM Syed Zaidi <[email protected]> wrote: > Hi Udit, > > Thanks for your recommendation. I was able to get the jars for 0.5.1. As a > test we ran hudi against a small dataset (~2 million rows with 80 columns) > in parquet file against 10 executors (m5.xlarge) . The initial load itself > is taking 2+ hours. Do you have any suggestions on the settings I can > update to speed up the process. > > Thanks > Syed Zaidi > > ________________________________ > From: Mehrotra, Udit <[email protected]> > Sent: Tuesday, March 17, 2020 8:08 PM > To: [email protected] <[email protected]>; Syed Zaidi < > [email protected]> > Subject: Re: Question > > Hi Zaidi, > > You should be able to use Hudi 0.5.1 in the next EMR release that should > be fairly soon, but we can't give you an ETA. Meanwhile, there is nothing > really stopping you to build your hudi 0.5.1 jars and replacing the ones on > EMR cluster. The jars are located on the master node at /usr/lib/hudi/. > Just replace the 0.5.0 jars there and have the symlink jars point to your > 0.5.1 jars. > > Thanks, > Udit Mehrotra > SDE | AWS EMR > > On 3/17/20, 5:34 PM, "Syed Zaidi" <[email protected]> wrote: > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender and > know the content is safe. > > > > Hi, > > AWS EMR emr-5.29.0 comes with spark Spark 2.4.4 and Hudi 0.5.0 ( > hudi-hadoop-mr-bundle-0.5.0-incubating.jar). In version 0.5.1 we have new > options for reading the AWS DMS change logs using DeltaStreamer. Do you > guys have any idea when will AWS support the newer version of hudi. What > options I have to upgrade hudi to the latest version while creating the EMR > to support AWS DMS payload out of the box. > > Would appreciate your feedback in this regard. > > Thanks > Syed Zaidi > > >
