Hi, We already support a user defined custom partitioner for bulk insert. So you can actually control it whichever way you like, for the initial load.
Thanks Vinoth On Tue, Feb 9, 2021 at 5:36 PM Rubens Rodrigues <rubenssoto2...@gmail.com> wrote: > Hi guys, > > Talking about my use case... > > I have datasets that ordering data by date makes a lot sense or ordering by > some id to have less touched files on merge operations. > On my use of delta lake I used to bootstrap tables ever ordering by one of > these fields and helps a lot on file pruning. > > Hudi clustering do this job but I think it is an unnecessary extra step to > do after bulk insert because all data will need to be rewrite again. > > > > Em ter, 9 de fev de 2021 21:53, Vinoth Chandar <vin...@apache.org> > escreveu: > > > Hi Satish, > > > > Been to respond to this. I think I like the idea overall. > > > > Here's a (hopefully) my understanding version and let me know if I am > > getting this right. > > > > Predominantly, we are just talking about the problem of: where do we send > > the "inserts" to. > > > > Today the upsert partitioner does the file sizing/bin-packing etc for > > inserts and then sends some inserts over to existing file groups to > > maintain file size. > > We can abstract all of this into strategies and some kind of pipeline > > abstractions and have it also consider "affinity" to an existing file > group > > based > > on say information stored in the metadata table? > > > > I think this is complimentary to what we do today and can be helpful. > First > > thing may be is to abstract the existing write pipeline as a series of > > "optimization" > > stages and bring things like file sizing under that. > > > > On bucketing, I am not against Hive bucketing or anything. But with > record > > level indexes and granular/micro partitions that we can achieve using > > clustering, is it still the most efficient design? That's a question I > > would love to find answers for. I never liked the static/hash > partitioning > > based schemes > > in bucketing. they introduce a lot of manual data munging, if things > > change. > > > > Thanks > > Vinoth > > > > > > > > On Wed, Feb 3, 2021 at 5:17 PM Satish Kotha <satishko...@uber.com.invalid > > > > wrote: > > > > > I got some feedback that this thread may be a bit complex to > understand. > > So > > > I tried to simplify proposal to below: > > > > > > Users can already specify 'partitionpath' using this config > > > < > > > > > > https://hudi.apache.org/docs/configurations.html#PARTITIONPATH_FIELD_OPT_KEY > > > > > > > when > > > writing data. My proposal is we also give users the ability to identify > > (or > > > hint at) 'fileId' to while writing the data. For example, users can > > > say 'locality.columns: > > > session_id'. We deterministically map every session_id to a specific > > > fileGroup in hudi (using hash-modulo or range-partitioning etc). So all > > > values for a session_id are co-located in the same data/log file. > > > > > > Hopefully, this explains the idea better. Appreciate any feedback. > > > > > > On Mon, Feb 1, 2021 at 3:43 PM Satish Kotha <satishko...@uber.com> > > wrote: > > > > > > > Hello, > > > > > > > > Clustering <https://hudi.apache.org/blog/hudi-clustering-intro/> is > a > > > > great feature for improving data locality. But it has a (relatively > > big) > > > > cost to rewrite the data after ingestion. I think there are other > ways > > to > > > > improve data locality during ingestion. For example, we can add a new > > > Index > > > > (or partitioner) that reads values for columns that are important > from > > a > > > > data locality perspective. We could then compute hash modulo on the > > value > > > > and use that to deterministically identify the file group that the > > record > > > > has to be written into. > > > > > > > > More detailed example: > > > > Assume we introduce 2 new config: > > > > hoodie.datasource.write.num.file.groups: "N" #Controls the total > number > > > of > > > > file Ids allowed (per partition). > > > > > > > > hoodie.datasource.write.locality.columns: "session_id,timestamp" > > > #Identify > > > > columns that are important for data locality. > > > > > > > > (I can come up with better names for config if the general idea > sounds > > > > good). > > > > > > > > During ingestion, we generate 'N' fileIds for each partition (if that > > > > partition has already K fileIds, we generate N-K new fileIds). Let's > > say > > > > these fileIds are stored in fileIdList data structure. For each row, > we > > > > compute 'hash(row.get(session_id)+row.get(timestamp)) % N'. This > value > > > is > > > > used as the index into fileIdList data structure to deterministically > > > > identify the file group for the row. > > > > > > > > This improves data locality by ensuring columns with a given value > are > > > > stored in the same file. This hashing could be done in two places: > > > > 1) A custom index that tags location for each row based on values for > > > > 'session_id+timestamp'. > > > > 2) In a new partitioner that assigns buckets for each row based of > > values > > > > for 'session_id+timestamp' > > > > > > > > *Advantages:* > > > > 1) No need to rewrite data for improving data locality. > > > > 2) Integrates well with hive bucketing (spark is also adding support > > for > > > > hive bucketing <https://issues.apache.org/jira/browse/SPARK-19256>) > > > > 3) This reduces scan cycles to find a particular key because this > > ensures > > > > that the key is present in a certain fileId. Similarly, joining > across > > > > multiple tables would be efficient if they both choose the same > > > > 'locality.columns'. > > > > > > > > *Disadvantages:* > > > > 1) Users need to know the total number of filegroups to generate per > > > > partition. This value is assumed to be static for all partitions. So > if > > > > significant changes are expected in traffic volume, this may not > > > partition > > > > the data well. (We could also consider making this static per > > partition, > > > > which adds additional complexity, but feasible to do) > > > > 2) This may not be as efficient as clustering. For example, data for > a > > > > given column value is guaranteed to be co-located in the same file. > > But > > > > they may not be in the same block (row group in parquet). So more > > blocks > > > > need to be read by query engines. > > > > > > > > > > > > Clustering can still be useful for other use cases such as stitching > > > > files, transforming data for efficiency etc. Clustering can also be > > > useful > > > > for a few sorting scenarios - e.g., if users cannot predict a good > > value > > > > for the number of file groups needed. > > > > > > > > Appreciate any feedback. Let me know if you have other ideas on > > improving > > > > data locality. If you are interested in this idea and want to > > > collaborate, > > > > please reach out. > > > > > > > > Thanks > > > > Satish > > > > > > > > > >