In general, please see
https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide for more tips
on tuning this in a real setting..

What lamber-ken mentioned should alleviate the issue of serialization, if
that's the bottleneck. Sine Hudi uses Spark caching during upsert
operation, also ensure you have sufficient spark executor memory.
In general, it may make more sense to try to run benchmark on a real
cluster and observe the bottlenecks..  Some tradeoffs we make (e.g caching)
may seem like overhead when running with small amount of data, but really
comes in handy when scaling it up.

I have some parallel efforts going on, to try to make the out-of-box single
node benchmark better, but until then if we can engage on a github issue
where you can paste code snippets and spark UI etc, happy to work with you
get that time down.

thanks
vinoth

On Wed, Mar 11, 2020 at 11:25 AM lamberken <[email protected]> wrote:

>
>
> Hi,
>
>
> The unit is byte, it is an example, you need to modify it according to
> your own env.
>
>
> Best,
> Lamber-Ken
>
>
>
> At 2020-03-12 01:51:20, "selvaraj periyasamy" <
> [email protected]> wrote:
> >Thanks . What is this number 2004857600000? is it in bits or bytes?
> >
> >Thanks,
> >Selva
> >
> >On Tue, Mar 10, 2020 at 2:57 AM lamberken <[email protected]> wrote:
> >
> >>
> >>
> >> hi,
> >>
> >>
> >> IMO, when upsert 150K record with 100columns, these records need
> >> serializate to disk and deserialize from disk.
> >> You can try add < option("hoodie.memory.merge.max.size",
> "2004857600000") >
> >>
> >>
> >> best,
> >> lamber-ken
> >>
> >>
> >>
> >>
> >>
> >> At 2020-03-10 17:07:58, "selvaraj periyasamy" <
> >> [email protected]> wrote:
> >>
> >> Sorry for the partial emails. My company portal don’t allow me to add
> test
> >> code .  Am using 0.5.0 version of Hudi Jars built from my local.  While
> >> running upsert , it takes more than 6 or 7 mins for processing 150k
> records.
> >>
> >>
> >>
> >> Is there any tuning that could reduce the processing time from 6 or 7
> mins
> >> ? Overwrite just takes less than a min ? Each row has 100 columns .
> >>
> >>
> >>
> >> Thanks,
> >> Selva
> >>
> >>
> >> On Tue, Mar 10, 2020 at 1:51 AM selvaraj periyasamy <
> >> [email protected]> wrote:
> >>
> >> Team,
> >>
> >>
> >> Am using 0.5.0 version of Hudi Jars built from my local.  While running
> >> upsert , it takes more than 6 or 7 mins for processing 150k records.
> Below
> >> are the code and logs.
> >>
> >>
> >> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
> >> records
> >> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
> >> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
> >> records
> >> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is
> done;
> >> notifying producer threads
> >>
> >>
> >> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
> >> records
> >> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
> >> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
> >> records
> >> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is
> done;
> >> notifying producer threads
> >>
> >>
> >> While running insert
> >>
> >>
> >> On Tue, Mar 10, 2020 at 1:45 AM selvaraj periyasamy <
> >> [email protected]> wrote:
> >>
> >> Team,
> >>
> >>
> >> Am using 0.5.0 version of Hudi Jars built from my local.  While running
> >> upsert
> >>
> >>
> >> 20/03/10 07:26:09 INFO IteratorBasedQueueProducer: starting to buffer
> >> records
> >> 20/03/10 07:26:09 INFO BoundedInMemoryExecutor: starting consumer thread
> >> 20/03/10 07:33:59 INFO IteratorBasedQueueProducer: finished buffering
> >> records
> >> 20/03/10 07:34:00 INFO BoundedInMemoryExecutor: Queue Consumption is
> done;
> >> notifying producer threads
> >>
> >>
> >> 20/03/10 07:26:08 INFO IteratorBasedQueueProducer: starting to buffer
> >> records
> >> 20/03/10 07:26:08 INFO BoundedInMemoryExecutor: starting consumer thread
> >> 20/03/10 07:33:31 INFO IteratorBasedQueueProducer: finished buffering
> >> records
> >> 20/03/10 07:33:31 INFO BoundedInMemoryExecutor: Queue Consumption is
> done;
> >> notifying producer threads
> >>
> >>
> >>
> >>
>

Reply via email to