If I need to plugin this hashing algorithm to resolve the partitions in
Presto and hive what is the code I should look into ?

On Wed, Jun 3, 2020, 12:04 PM tanu dua <tanu.dua...@gmail.com> wrote:

> Yes that’s also on cards and for developers that’s ok but we need to
> provide an interface to our ops people to execute the queries from presto
> so I need to find out if they fire a query on primary key how can I
> calculate the hash. They can fire a query including primary key with other
> fields. So that is the only problem I see in hash partitions and to get if
> work I believe I need to go deeper into presto Hudi plugin
>
> On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <shahjaimin0...@gmail.com>
> wrote:
>
>> Hi Tanu,
>>
>> If your primary key is integer you can add one more field as hash of
>> integer and partition based on hash field. It will add some complexity to
>> read and write because hash has to be computed prior to each read or
>> write.
>> Not whether overhead of doing this exceeds performance gains due to less
>> partitions. I wonder why HUDI don't directly support hash based
>> partitions?
>>
>> Thanks
>> Jaimin
>>
>> On Wed, 3 Jun 2020 at 10:07, tanu dua <tanu.dua...@gmail.com> wrote:
>>
>> > Thanks Vinoth for detailed explanation. Even I was thinking on the same
>> > lines and I will relook. We can reduce the 2nd and 3rd partition but
>> it’s
>> > very difficult to reduce the 1st partition as that is the basic primary
>> key
>> > of our domain model on which analysts and developers need to query
>> almost
>> > 90% of time and its an integer primary key and can’t be decomposed
>> further.
>> >
>> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vin...@apache.org>
>> wrote:
>> >
>> > > Hi tanu,
>> > >
>> > > For good query performance, its recommended to write optimally sized
>> > files.
>> > > Hudi already ensures that.
>> > >
>> > > Generally speaking, if you have too many partitions, then it also
>> means
>> > too
>> > > many files. Mostly people limit to 1000s of partitions in their
>> datasets,
>> > > since queries typically crunch data based on time or a business_domain
>> > (e.g
>> > > city for uber)..  Partitioning too granular - say based on user_id -
>> is
>> > not
>> > > very useful unless your queries only crunch per user.. if you are
>> using
>> > > Hive metastore then 25M partitions mean 25M rows in your backing mysql
>> > > metastore db as well - not very scalable.
>> > >
>> > > What I am trying to say is : even outside of Hudi, if analytics is
>> your
>> > use
>> > > case, might be worth partitioning at lower granularity and increase
>> rows
>> > > per parquet file.
>> > >
>> > > Thanks
>> > > Vinoth
>> > >
>> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <tanu.dua...@gmail.com> wrote:
>> > >
>> > > > Hi,
>> > > > We have a requirement to ingest 30M records in S3 backed up by
>> HUDI. I
>> > am
>> > > > figuring out the partition strategy and ending up with lot of
>> > partitions
>> > > > like 25M partitions (primary partition) --> 2.5 M (secondary
>> partition)
>> > > -->
>> > > > 2.5 M (third partition) and each parquet file will have the records
>> > with
>> > > > less than 10 rows of data.
>> > > >
>> > > > Our dataset will be ingested at once in full and then it will be
>> > > > incremental daily with less than 1k updates. So its more read heavy
>> > > rather
>> > > > than write heavy
>> > > >
>> > > > So what should be the suggestion in terms of HUDI performance - go
>> > ahead
>> > > > with the above partition strategy or shall I reduce my partitions
>> and
>> > > > increase  no of rows in each parquet file.
>> > > >
>> > >
>> >
>>
>

Reply via email to