If I need to plugin this hashing algorithm to resolve the partitions in Presto and hive what is the code I should look into ?
On Wed, Jun 3, 2020, 12:04 PM tanu dua <tanu.dua...@gmail.com> wrote: > Yes that’s also on cards and for developers that’s ok but we need to > provide an interface to our ops people to execute the queries from presto > so I need to find out if they fire a query on primary key how can I > calculate the hash. They can fire a query including primary key with other > fields. So that is the only problem I see in hash partitions and to get if > work I believe I need to go deeper into presto Hudi plugin > > On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <shahjaimin0...@gmail.com> > wrote: > >> Hi Tanu, >> >> If your primary key is integer you can add one more field as hash of >> integer and partition based on hash field. It will add some complexity to >> read and write because hash has to be computed prior to each read or >> write. >> Not whether overhead of doing this exceeds performance gains due to less >> partitions. I wonder why HUDI don't directly support hash based >> partitions? >> >> Thanks >> Jaimin >> >> On Wed, 3 Jun 2020 at 10:07, tanu dua <tanu.dua...@gmail.com> wrote: >> >> > Thanks Vinoth for detailed explanation. Even I was thinking on the same >> > lines and I will relook. We can reduce the 2nd and 3rd partition but >> it’s >> > very difficult to reduce the 1st partition as that is the basic primary >> key >> > of our domain model on which analysts and developers need to query >> almost >> > 90% of time and its an integer primary key and can’t be decomposed >> further. >> > >> > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <vin...@apache.org> >> wrote: >> > >> > > Hi tanu, >> > > >> > > For good query performance, its recommended to write optimally sized >> > files. >> > > Hudi already ensures that. >> > > >> > > Generally speaking, if you have too many partitions, then it also >> means >> > too >> > > many files. Mostly people limit to 1000s of partitions in their >> datasets, >> > > since queries typically crunch data based on time or a business_domain >> > (e.g >> > > city for uber).. Partitioning too granular - say based on user_id - >> is >> > not >> > > very useful unless your queries only crunch per user.. if you are >> using >> > > Hive metastore then 25M partitions mean 25M rows in your backing mysql >> > > metastore db as well - not very scalable. >> > > >> > > What I am trying to say is : even outside of Hudi, if analytics is >> your >> > use >> > > case, might be worth partitioning at lower granularity and increase >> rows >> > > per parquet file. >> > > >> > > Thanks >> > > Vinoth >> > > >> > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <tanu.dua...@gmail.com> wrote: >> > > >> > > > Hi, >> > > > We have a requirement to ingest 30M records in S3 backed up by >> HUDI. I >> > am >> > > > figuring out the partition strategy and ending up with lot of >> > partitions >> > > > like 25M partitions (primary partition) --> 2.5 M (secondary >> partition) >> > > --> >> > > > 2.5 M (third partition) and each parquet file will have the records >> > with >> > > > less than 10 rows of data. >> > > > >> > > > Our dataset will be ingested at once in full and then it will be >> > > > incremental daily with less than 1k updates. So its more read heavy >> > > rather >> > > > than write heavy >> > > > >> > > > So what should be the suggestion in terms of HUDI performance - go >> > ahead >> > > > with the above partition strategy or shall I reduce my partitions >> and >> > > > increase no of rows in each parquet file. >> > > > >> > > >> > >> >