Re: Spark Configuration

Vinoth Chandar Fri, 19 Jul 2019 05:51:49 -0700

sg!

As with any database-like systems, performance is dependent on key design
and configuration.
Happy to share more tips on tuning if you can give more details on


- use-case, what operation you are using?
- % of the 25 Billion records updated in each run (for e.g if you are
upserting the entire dataset, then it will be slower ofc than just
bulk_inserting)
- can you make the key by prefixed by some increasing/ordered value like a
timestamp

a lot of this is also covered in the two links I sent.


On Thu, Jul 18, 2019 at 10:37 PM Amarnath Venkataswamy <
[email protected]> wrote:

> After I set the shuffle parallelism i can able to complete the job without
> failure but there is one more challenge to reduce the GC time.Currently it
> is taking 20 to 30% per task from overall run time.
>
> I have to test with GC with extra java options by tomorrow.
>
> My goal is to do the update on 25 billion rows span across 100 days of
> partitions with 240 million records(2GB size) in  each partition with 50%
> update on previous day partition and rest spread across remaining 99 days.
>
> Currently it is taking 30 to 40 mins for  just to write into 1
> partition.out of this 20 to 30% time goes to GC.
>
> If we can do this in less than one to 2 hours(incremental update : 240
> million daily) after tuning all the memory and other parameters i would be
> very happy.
>
>
>
>
> On Fri, Jul 19, 2019 at 12:19 AM Amarnath Venkataswamy <
> [email protected]> wrote:
>
> > yes.I am looking for the same thing only.
> >
> > On Thu, Jul 18, 2019 at 9:20 PM Vinoth Chandar <[email protected]>
> wrote:
> >
> >> No real reason. If you notice a sample configuration is  presented under
> >> “gc tuning” section and asks the user to add it to extraJavaOptions. Its
> >> separate coz its for cms and someone else may want to do g1
> >>
> >> On Thu, Jul 18, 2019 at 5:26 PM Gary Li <[email protected]>
> wrote:
> >>
> >> > One related question. The GC tuning part says [must] use G1/CMS
> >> collector,
> >> > but the recommended production config doesn’t specify any GC. Is
> there a
> >> > reason behind this?
> >> >
> >> > On Thu, Jul 18, 2019 at 9:37 AM Vinoth Chandar <[email protected]>
> >> wrote:
> >> >
> >> > > https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide
> >> > > https://hudi.apache.org/performance.html
> >> > > are good resources for what you need.
> >> > >
> >> > > On Thu, Jul 18, 2019 at 7:37 AM Amarnath Venkataswamy <
> >> > > [email protected]> wrote:
> >> > >
> >> > > > Hi
> >> > > >
> >> > > > Can you anyone of you share the Spark configuration used at UBER I
> >> > didn't
> >> > > > save that link to my favorites.
> >> > > >
> >> > > > I am currently doing some performance test against 240million
> >> records
> >> > and
> >> > > > job is failing for one or other reasons with memory.
> >> > > >
> >> > > > Regards
> >> > > > Amarnath
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Spark Configuration

Reply via email to