Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Vinoth Chandar Wed, 03 Jun 2020 21:45:36 -0700

This is a good conversation. The ask for support of bucketed tables has not
actually come up much, since if you are looking up things at that
granularity, it almost feels like you are doing OLTP/database like queries?


Assuming you hash the primary key into a hash that denotes the partition,
then a simple workaround is to always add a where clause using a UDF in
presto, I.e where key = 123 and partition = hash_udf(123)

But of course the down side Is that your ops team needs to remember to add
the second partition clause (which is not very different from querying
large time partitioned tables today)

Our mid term plan is to build out column indexes (RFC-15 has the details,
if you are interested)

On Wed, Jun 3, 2020 at 2:54 AM tanu dua <[email protected]> wrote:

> If I need to plugin this hashing algorithm to resolve the partitions in
> Presto and hive what is the code I should look into ?
>
> On Wed, Jun 3, 2020, 12:04 PM tanu dua <[email protected]> wrote:
>
> > Yes that’s also on cards and for developers that’s ok but we need to
> > provide an interface to our ops people to execute the queries from presto
> > so I need to find out if they fire a query on primary key how can I
> > calculate the hash. They can fire a query including primary key with
> other
> > fields. So that is the only problem I see in hash partitions and to get
> if
> > work I believe I need to go deeper into presto Hudi plugin
> >
> > On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <[email protected]>
> > wrote:
> >
> >> Hi Tanu,
> >>
> >> If your primary key is integer you can add one more field as hash of
> >> integer and partition based on hash field. It will add some complexity
> to
> >> read and write because hash has to be computed prior to each read or
> >> write.
> >> Not whether overhead of doing this exceeds performance gains due to less
> >> partitions. I wonder why HUDI don't directly support hash based
> >> partitions?
> >>
> >> Thanks
> >> Jaimin
> >>
> >> On Wed, 3 Jun 2020 at 10:07, tanu dua <[email protected]> wrote:
> >>
> >> > Thanks Vinoth for detailed explanation. Even I was thinking on the
> same
> >> > lines and I will relook. We can reduce the 2nd and 3rd partition but
> >> it’s
> >> > very difficult to reduce the 1st partition as that is the basic
> primary
> >> key
> >> > of our domain model on which analysts and developers need to query
> >> almost
> >> > 90% of time and its an integer primary key and can’t be decomposed
> >> further.
> >> >
> >> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <[email protected]>
> >> wrote:
> >> >
> >> > > Hi tanu,
> >> > >
> >> > > For good query performance, its recommended to write optimally sized
> >> > files.
> >> > > Hudi already ensures that.
> >> > >
> >> > > Generally speaking, if you have too many partitions, then it also
> >> means
> >> > too
> >> > > many files. Mostly people limit to 1000s of partitions in their
> >> datasets,
> >> > > since queries typically crunch data based on time or a
> business_domain
> >> > (e.g
> >> > > city for uber)..  Partitioning too granular - say based on user_id -
> >> is
> >> > not
> >> > > very useful unless your queries only crunch per user.. if you are
> >> using
> >> > > Hive metastore then 25M partitions mean 25M rows in your backing
> mysql
> >> > > metastore db as well - not very scalable.
> >> > >
> >> > > What I am trying to say is : even outside of Hudi, if analytics is
> >> your
> >> > use
> >> > > case, might be worth partitioning at lower granularity and increase
> >> rows
> >> > > per parquet file.
> >> > >
> >> > > Thanks
> >> > > Vinoth
> >> > >
> >> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <[email protected]> wrote:
> >> > >
> >> > > > Hi,
> >> > > > We have a requirement to ingest 30M records in S3 backed up by
> >> HUDI. I
> >> > am
> >> > > > figuring out the partition strategy and ending up with lot of
> >> > partitions
> >> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
> >> partition)
> >> > > -->
> >> > > > 2.5 M (third partition) and each parquet file will have the
> records
> >> > with
> >> > > > less than 10 rows of data.
> >> > > >
> >> > > > Our dataset will be ingested at once in full and then it will be
> >> > > > incremental daily with less than 1k updates. So its more read
> heavy
> >> > > rather
> >> > > > than write heavy
> >> > > >
> >> > > > So what should be the suggestion in terms of HUDI performance - go
> >> > ahead
> >> > > > with the above partition strategy or shall I reduce my partitions
> >> and
> >> > > > increase  no of rows in each parquet file.
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Reply via email to