sg! As with any database-like systems, performance is dependent on key design and configuration. Happy to share more tips on tuning if you can give more details on
- use-case, what operation you are using? - % of the 25 Billion records updated in each run (for e.g if you are upserting the entire dataset, then it will be slower ofc than just bulk_inserting) - can you make the key by prefixed by some increasing/ordered value like a timestamp a lot of this is also covered in the two links I sent. On Thu, Jul 18, 2019 at 10:37 PM Amarnath Venkataswamy < [email protected]> wrote: > After I set the shuffle parallelism i can able to complete the job without > failure but there is one more challenge to reduce the GC time.Currently it > is taking 20 to 30% per task from overall run time. > > I have to test with GC with extra java options by tomorrow. > > My goal is to do the update on 25 billion rows span across 100 days of > partitions with 240 million records(2GB size) in each partition with 50% > update on previous day partition and rest spread across remaining 99 days. > > Currently it is taking 30 to 40 mins for just to write into 1 > partition.out of this 20 to 30% time goes to GC. > > If we can do this in less than one to 2 hours(incremental update : 240 > million daily) after tuning all the memory and other parameters i would be > very happy. > > > > > On Fri, Jul 19, 2019 at 12:19 AM Amarnath Venkataswamy < > [email protected]> wrote: > > > yes.I am looking for the same thing only. > > > > On Thu, Jul 18, 2019 at 9:20 PM Vinoth Chandar <[email protected]> > wrote: > > > >> No real reason. If you notice a sample configuration is presented under > >> “gc tuning” section and asks the user to add it to extraJavaOptions. Its > >> separate coz its for cms and someone else may want to do g1 > >> > >> On Thu, Jul 18, 2019 at 5:26 PM Gary Li <[email protected]> > wrote: > >> > >> > One related question. The GC tuning part says [must] use G1/CMS > >> collector, > >> > but the recommended production config doesn’t specify any GC. Is > there a > >> > reason behind this? > >> > > >> > On Thu, Jul 18, 2019 at 9:37 AM Vinoth Chandar <[email protected]> > >> wrote: > >> > > >> > > https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide > >> > > https://hudi.apache.org/performance.html > >> > > are good resources for what you need. > >> > > > >> > > On Thu, Jul 18, 2019 at 7:37 AM Amarnath Venkataswamy < > >> > > [email protected]> wrote: > >> > > > >> > > > Hi > >> > > > > >> > > > Can you anyone of you share the Spark configuration used at UBER I > >> > didn't > >> > > > save that link to my favorites. > >> > > > > >> > > > I am currently doing some performance test against 240million > >> records > >> > and > >> > > > job is failing for one or other reasons with memory. > >> > > > > >> > > > Regards > >> > > > Amarnath > >> > > > > >> > > > >> > > >> > > >
