Yes that’s also on cards and for developers that’s ok but we need to provide an interface to our ops people to execute the queries from presto so I need to find out if they fire a query on primary key how can I calculate the hash. They can fire a query including primary key with other fields. So that is the only problem I see in hash partitions and to get if work I believe I need to go deeper into presto Hudi plugin
On Wed, 3 Jun 2020 at 11:48 AM, Jaimin Shah <[email protected]> wrote: > Hi Tanu, > > If your primary key is integer you can add one more field as hash of > integer and partition based on hash field. It will add some complexity to > read and write because hash has to be computed prior to each read or write. > Not whether overhead of doing this exceeds performance gains due to less > partitions. I wonder why HUDI don't directly support hash based partitions? > > Thanks > Jaimin > > On Wed, 3 Jun 2020 at 10:07, tanu dua <[email protected]> wrote: > > > Thanks Vinoth for detailed explanation. Even I was thinking on the same > > lines and I will relook. We can reduce the 2nd and 3rd partition but it’s > > very difficult to reduce the 1st partition as that is the basic primary > key > > of our domain model on which analysts and developers need to query almost > > 90% of time and its an integer primary key and can’t be decomposed > further. > > > > On Wed, 3 Jun 2020 at 9:23 AM, Vinoth Chandar <[email protected]> wrote: > > > > > Hi tanu, > > > > > > For good query performance, its recommended to write optimally sized > > files. > > > Hudi already ensures that. > > > > > > Generally speaking, if you have too many partitions, then it also means > > too > > > many files. Mostly people limit to 1000s of partitions in their > datasets, > > > since queries typically crunch data based on time or a business_domain > > (e.g > > > city for uber).. Partitioning too granular - say based on user_id - is > > not > > > very useful unless your queries only crunch per user.. if you are using > > > Hive metastore then 25M partitions mean 25M rows in your backing mysql > > > metastore db as well - not very scalable. > > > > > > What I am trying to say is : even outside of Hudi, if analytics is your > > use > > > case, might be worth partitioning at lower granularity and increase > rows > > > per parquet file. > > > > > > Thanks > > > Vinoth > > > > > > On Tue, Jun 2, 2020 at 3:18 AM Tanuj <[email protected]> wrote: > > > > > > > Hi, > > > > We have a requirement to ingest 30M records in S3 backed up by HUDI. > I > > am > > > > figuring out the partition strategy and ending up with lot of > > partitions > > > > like 25M partitions (primary partition) --> 2.5 M (secondary > partition) > > > --> > > > > 2.5 M (third partition) and each parquet file will have the records > > with > > > > less than 10 rows of data. > > > > > > > > Our dataset will be ingested at once in full and then it will be > > > > incremental daily with less than 1k updates. So its more read heavy > > > rather > > > > than write heavy > > > > > > > > So what should be the suggestion in terms of HUDI performance - go > > ahead > > > > with the above partition strategy or shall I reduce my partitions and > > > > increase no of rows in each parquet file. > > > > > > > > > >
