Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

tanu dua Tue, 02 Jun 2020 23:35:47 -0700

Yes that’s also on cards and for developers that’s ok but we need to
provide an interface to our ops people to execute the queries from presto
so I need to find out if they fire a query on primary key how can I
calculate the hash. They can fire a query including primary key with other
fields. So that is the only problem I see in hash partitions and to get if
work I believe I need to go deeper into presto Hudi plugin


On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <[email protected]>
wrote:

> Hi Tanu,
>
> If your primary key is integer you can add one more field as hash of
> integer and partition based on hash field. It will add some complexity to
> read and write because hash has to be computed prior to each read or write.
> Not whether overhead of doing this exceeds performance gains due to less
> partitions. I wonder why HUDI don't directly support hash based partitions?
>
> Thanks
> Jaimin
>
> On Wed, 3 Jun 2020 at 10:07, tanu dua <[email protected]> wrote:
>
> > Thanks Vinoth for detailed explanation. Even I was thinking on the same
> > lines and I will relook. We can reduce the 2nd and 3rd partition but it’s
> > very difficult to reduce the 1st partition as that is the basic primary
> key
> > of our domain model on which analysts and developers need to query almost
> > 90% of time and its an integer primary key and can’t be decomposed
> further.
> >
> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <[email protected]> wrote:
> >
> > > Hi tanu,
> > >
> > > For good query performance, its recommended to write optimally sized
> > files.
> > > Hudi already ensures that.
> > >
> > > Generally speaking, if you have too many partitions, then it also means
> > too
> > > many files. Mostly people limit to 1000s of partitions in their
> datasets,
> > > since queries typically crunch data based on time or a business_domain
> > (e.g
> > > city for uber)..  Partitioning too granular - say based on user_id - is
> > not
> > > very useful unless your queries only crunch per user.. if you are using
> > > Hive metastore then 25M partitions mean 25M rows in your backing mysql
> > > metastore db as well - not very scalable.
> > >
> > > What I am trying to say is : even outside of Hudi, if analytics is your
> > use
> > > case, might be worth partitioning at lower granularity and increase
> rows
> > > per parquet file.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <[email protected]> wrote:
> > >
> > > > Hi,
> > > > We have a requirement to ingest 30M records in S3 backed up by HUDI.
> I
> > am
> > > > figuring out the partition strategy and ending up with lot of
> > partitions
> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
> partition)
> > > -->
> > > > 2.5 M (third partition) and each parquet file will have the records
> > with
> > > > less than 10 rows of data.
> > > >
> > > > Our dataset will be ingested at once in full and then it will be
> > > > incremental daily with less than 1k updates. So its more read heavy
> > > rather
> > > > than write heavy
> > > >
> > > > So what should be the suggestion in terms of HUDI performance - go
> > ahead
> > > > with the above partition strategy or shall I reduce my partitions and
> > > > increase  no of rows in each parquet file.
> > > >
> > >
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Reply via email to