Hi folks, I haven't received your reply for a long time. I think you must have something more important to do. Am I right? : )
At 2023-03-31 21:40:46, "吕虎" <lvh...@163.com> wrote: >Hi Vinoth, I'm glad to receive your letter. Here are some of my thoughts. >At 2023-03-31 10:17:52, "Vinoth Chandar" <vin...@apache.org> wrote: >>I think we can focus more on validating the hash index + bloom filter vs >>consistent hash index more first. Have you looked at RFC-08, which is a >>kind of hash index as well, except it stores the key => file group mapping >>externally. > > The idea of RFC-08 Index (rowKey ->pationPath, fileID) is very similar > to HBase index, but its index is implemented internally in Hudi, so there is > no need to worry about consistency issues. Index can be written to HFiles > quickly, but when reading, it is necessary to read from multiple HFiles, so > the performance of reading an index can be a problem. Therefore, RFC proposer > naturally thought of using hash buckets to partially solve this problems. > HBase's solution to multiple HFILE files is to add a maximum and minimum > index and a Bloom filter index. In Hudi, you can directly create a maximum > and minimum index and a Bloom filter index for FileGroups, eliminating the > need to store the index in HFILE; Another solution is to do a compaction on > HFILE files, but it also adds a burden to hudi.We need to consider the > performance of reading HFile well when using RFC-08. > >Therefore, I believe that hash partition + bloom filter is still the simplest >and most effective solution for predictable data growth in a small range. > > > > > > > > > > > > > > > > > >At 2023-03-31 10:17:52, "Vinoth Chandar" <vin...@apache.org> wrote: >>I think we can focus more on validating the hash index + bloom filter vs >>consistent hash index more first. Have you looked at RFC-08, which is a >>kind of hash index as well, except it stores the key => file group mapping >>externally. >> >>On Fri, Mar 24, 2023 at 2:14 AM 吕虎 <lvh...@163.com> wrote: >> >>> Hi Vinoth, I am very happy to receive your reply. Here are some of my >>> thoughts。 >>> >>> At 2023-03-21 23:32:44, "Vinoth Chandar" <vin...@apache.org> wrote: >>> >>but when it is used for data expansion, it still involves the need to >>> >redistribute the data records of some data files, thus affecting the >>> >performance. >>> >but expansion of the consistent hash index is an optional operation right? >>> >>> >Sorry, not still fully understanding the differences here, >>> I'm sorry I didn't make myself clearly. The expansion I mentioned last >>> time refers to data records increase in hudi table. >>> The difference between consistent hash index and hash partition with Bloom >>> filters index is how to deal with data increase: >>> For consistent hash index, the way of splitting the file is used. >>> Splitting files affects performance, but can permanently work effectively. >>> So consistent hash index is suitable for scenarios where data increase >>> cannot be estimated or data will increase large. >>> For hash partitions with Bloom filters index, the way of creating new >>> files is used. Adding new files does not affect performance, but if there >>> are too many files, the probability of false positives in the Bloom filters >>> will increase. So hash partitions with Bloom filters index is suitable for >>> scenario where data increase can be estimated over a relatively small range. >>> >>> >>> >>Because the hash partition field values under the parquet file in a >>> >columnar storage format are all equal, the added column field hardly >>> >occupies storage space after compression. >>> >Any new meta field added adds other overhead in terms evolving the schema, >>> >so forth. are you suggesting this is not possible to do without a new meta >>> >field? >>> >>> No new meta field implementation is a more elegant implementation, but >>> for me, who is not yet familiar with the Hudi source code, it is somewhat >>> difficult to implement, but it is not a problem for experts. If you want to >>> implement it without adding new meta fields, I hope I can participate in >>> some simple development, and I can also learn how experts can do it. >>> >>> >>> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎 <lvh...@163.com> wrote: >>> > >>> >> Hello, >>> >> I feel very honored that you are interested in my views. >>> >> >>> >> Here are some of my thoughts marked with blue font. >>> >> >>> >> At 2023-03-16 13:18:08, "Vinoth Chandar" <vin...@apache.org> wrote: >>> >> >>> >> >Thanks for the proposal! Some first set of questions here. >>> >> > >>> >> >>You need to pre-select the number of buckets and use the hash >>> function to >>> >> >determine which bucket a record belongs to. >>> >> >>when building the table according to the estimated amount of data, >>> and it >>> >> >cannot be changed after building the table >>> >> >>When the amount of data in a hash partition is too large, the data in >>> >> that >>> >> >partition will be split into multiple files in the way of Bloom index. >>> >> > >>> >> >All these issues are related to bucket sizing could be alleviated by >>> the >>> >> >consistent hashing index in 0.13? Have you checked it out? Love to hear >>> >> >your thoughts on this. >>> >> >>> >> Hash partitioning is applicable to data tables that cannot give the >>> exact >>> >> capacity of data, but can estimate a rough range. For example, if a >>> company >>> >> currently has 300 million customers in the United States, the company >>> will >>> >> have 7 billion customers in the world at most. In this scenario, using >>> hash >>> >> partitioning to cope with data growth within the known range by directly >>> >> adding files and establishing bloom filters can still guarantee >>> >> performance. >>> >> The consistent hash bucket index is also very valuable, but when it is >>> >> used for data expansion, it still involves the need to redistribute the >>> >> data records of some data files, thus affecting the performance. When >>> it is >>> >> completely impossible to estimate the range of data capacity, it is very >>> >> suitable to use consistent hashing. >>> >> >> you can directly search the data under the partition, which greatly >>> >> >reduces the scope of the Bloom filter to search for files and reduces >>> the >>> >> >false positive of the Bloom filter. >>> >> >the bloom index is already partition aware and unless you use the >>> global >>> >> >version can achieve this. Am I missing something? >>> >> > >>> >> >Another key thing is - if we can avoid adding a new meta field, that >>> would >>> >> >be great. Is it possible to implement this similar to bucket index, >>> based >>> >> >on jsut table properties? >>> >> Add a hash partition field in the table to implement the hash partition >>> >> function, which can well reuse the existing partition function, and >>> >> involves very few code changes. Because the hash partition field values >>> >> under the parquet file in a columnar storage format are all equal, the >>> >> added column field hardly occupies storage space after compression. >>> >> Of course, it is not necessary to add hash partition fields in the >>> table, >>> >> but to store hash partition fields in the corresponding metadata to >>> achieve >>> >> this function, but it will be difficult to reuse the existing functions. >>> >> The establishment of hash partition and partition pruning during query >>> need >>> >> more time to develop code and test again. >>> >> >On Sat, Feb 18, 2023 at 8:18 PM 吕虎 <lvh...@163.com> wrote: >>> >> > >>> >> >> Hi folks, >>> >> >> >>> >> >> Here is my proposal.Thank you very much for reading it.I am looking >>> >> >> forward to your agreement to create an RFC for it. >>> >> >> >>> >> >> Background >>> >> >> >>> >> >> In order to deal with the problem that the modification of a small >>> >> amount >>> >> >> of local data needs to rewrite the entire partition data, Hudi >>> divided >>> >> the >>> >> >> partition into multiple File Groups, and each File Group is >>> identified >>> >> by >>> >> >> the File ID. In this way, when a small amount of local data is >>> modified, >>> >> >> only the data of the corresponding File Group needs to be rewritten. >>> >> Hudi >>> >> >> consistently maps the given Hudi record to the File ID through the >>> index >>> >> >> mechanism. The mapping relationship between Record Key and File >>> >> Group/File >>> >> >> ID will not change once the first version of Record is determined. >>> >> >> >>> >> >> At present, Hudi's indexes mainly include Bloom filter index, >>> Hbase >>> >> >> index and bucket index. The Bloom filter index has a false positive >>> >> >> problem. When a large amount of data results in a large number of >>> File >>> >> >> Groups, the false positive problem will magnify and lead to poor >>> >> >> performance. The Hbase index depends on the external Hbase database, >>> and >>> >> >> may be inconsistent, which will ultimately increase the operation and >>> >> >> maintenance costs. Bucket index makes each bucket of the bucket index >>> >> >> correspond to a File Group. You need to pre-select the number of >>> buckets >>> >> >> and use the hash function to determine which bucket a record belongs >>> to. >>> >> >> Therefore, you can directly determine the mapping relationship >>> between >>> >> the >>> >> >> Record Key and the File Group/File ID through the hash function. >>> Using >>> >> the >>> >> >> bucket index, you need to determine the number of buckets in advance >>> >> when >>> >> >> building the table according to the estimated amount of data, and it >>> >> cannot >>> >> >> be changed after building the table. The unreasonable number of >>> buckets >>> >> >> will seriously affect performance. Unfortunately, the amount of data >>> is >>> >> >> often unpredictable and will continue to grow. >>> >> >> >>> >> >> Hash partition feasibility >>> >> >> >>> >> >> In this context, I put forward the idea of hash partitioning. >>> The >>> >> >> principle is similar to bucket index, but in addition to the >>> advantages >>> >> of >>> >> >> bucket index, hash partitioning can retain the Bloom index. When the >>> >> amount >>> >> >> of data in a hash partition is too large, the data in that partition >>> >> will >>> >> >> be split into multiple files in the way of Bloom index. Therefore, >>> the >>> >> >> problem that bucket index depends heavily on the number of buckets >>> does >>> >> not >>> >> >> exist in the hash partition. Compared with the Bloom index, when >>> >> searching >>> >> >> for a data, you can directly search the data under the partition, >>> which >>> >> >> greatly reduces the scope of the Bloom filter to search for files and >>> >> >> reduces the false positive of the Bloom filter. >>> >> >> >>> >> >> Design of a simple hash partition implementation >>> >> >> >>> >> >> The idea is to use the capabilities of the ComplexKeyGenerator to >>> >> >> implement hash partitioning. Hash partition field is one of the >>> >> partition >>> >> >> fields of the ComplexKeyGenerator. >>> >> >> >>> >> >> When hash.partition.fields is specified and partition.fields >>> contains >>> >> >> _hoodie_hash_partition, a column named _hoodie_hash_partition will be >>> >> added >>> >> >> in this table as one of the partition key. >>> >> >> >>> >> >> If predicates of hash.partition.fields appear in the query statement, >>> >> the >>> >> >> _hoodie_hash_partition = X predicate will be automatically added to >>> the >>> >> >> query statement for partition pruning. >>> >> >> >>> >> >> Advantages of this design: simple implementation, no modification of >>> >> core >>> >> >> functions, so low risk. >>> >> >> >>> >> >> The above design has been implemented in pr 7984. >>> >> >> >>> >> >> https://github.com/apache/hudi/pull/7984 >>> >> >> >>> >> >> How to use hash partition in spark data source can refer to >>> >> >> >>> >> >> >>> >> >> >>> >> >>> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala >>> >> >> >>> >> >> #testHashPartition >>> >> >> >>> >> >> Perhaps for experts, the implementation of PR is not elegant enough. >>> I >>> >> >> also look forward to a more elegant implementation, from which I can >>> >> learn >>> >> >> more. >>> >> >> >>> >> >> Problems with hash partition: >>> >> >> >>> >> >> Because SparkSQL does not know the existence of hash partitions, when >>> >> two >>> >> >> hash-partitioned tables perform associated queries, it may not get >>> the >>> >> >> optimal execution plan. However, because hudi still has the ability >>> to >>> >> >> prune a single table, the performance of two tables can still be >>> greatly >>> >> >> improved by performing associated queries, compared with no hash >>> >> >> partitioning. >>> >> >> >>> >> >> The same problem also exists in the operation of group by. SparkSQL >>> does >>> >> >> not know that the aggregation operation has been completed in the >>> hash >>> >> >> partition. It may aggregate the results of different partition >>> >> aggregation >>> >> >> again. However, due to the data volume after partition aggregation is >>> >> very >>> >> >> small, the redundant and useless aggregation operation has little >>> >> impact on >>> >> >> the overall performance. >>> >> >> >>> >> >> I'm not sure whether bucket indexes will have the same problem as >>> hash >>> >> >> partitions. >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> At 2023-02-17 05:18:19, "Y Ethan Guo" <yi...@apache.org> wrote: >>> >> >> >+1 Thanks Lvhu for bringing up the idea. As Alexey suggested, it >>> >> would be >>> >> >> >good for you to write down the proposal with design details for >>> >> discussion >>> >> >> >in the community. >>> >> >> > >>> >> >> >On Thu, Feb 16, 2023 at 11:28 AM Alexey Kudinkin < >>> ale...@onehouse.ai> >>> >> >> wrote: >>> >> >> > >>> >> >> >> Thanks for your contribution, Lvhu! >>> >> >> >> >>> >> >> >> I think we should actually kick-start this effort with an small >>> RFC >>> >> >> >> outlining proposed changes first, as this is modifying the core >>> >> >> read-flow >>> >> >> >> for all Hudi tables and we want to make sure our approach there is >>> >> >> >> rock-solid. >>> >> >> >> >>> >> >> >> On Thu, Feb 16, 2023 at 6:34 AM 吕虎 <lvh...@163.com> wrote: >>> >> >> >> >>> >> >> >> > Hi folks, >>> >> >> >> > PR 7984【 https://github.com/apache/hudi/pull/7984 】 >>> >> implements >>> >> >> >> hash >>> >> >> >> > partitioning. >>> >> >> >> > As you know, It is often difficult to find an appropriate >>> >> >> partition >>> >> >> >> > key in the existing big data. Hash partitioning can easily solve >>> >> this >>> >> >> >> > problem. it can greatly improve the performance of hudi's big >>> data >>> >> >> >> > processing. >>> >> >> >> > The idea is to use the hash partition field as one of the >>> >> >> partition >>> >> >> >> > fields of the ComplexKeyGenerator, so this PR implementation >>> does >>> >> not >>> >> >> >> > involve logic modification of core code. >>> >> >> >> > The codes are easy to review, but I think hash partition >>> is >>> >> very >>> >> >> >> > usefull. we really need it. >>> >> >> >> > How to use hash partition in spark data source can refer >>> to >>> >> >> >> > >>> >> >> >> >>> >> >> >>> >> >>> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala >>> >> >> >> > #testHashPartition >>> >> >> >> > >>> >> >> >> > No public API or user-facing feature change or any >>> >> performance >>> >> >> >> > impact if the hash partition parameters are not specified. >>> >> >> >> > >>> >> >> >> > When hash.partition.fields is specified and >>> partition.fields >>> >> >> >> > contains _hoodie_hash_partition, a column named >>> >> _hoodie_hash_partition >>> >> >> >> will >>> >> >> >> > be added in this table as one of the partition key. >>> >> >> >> > >>> >> >> >> > If predicates of hash.partition.fields appear in the query >>> >> >> >> > statement, the _hoodie_hash_partition = X predicate will be >>> >> >> automatically >>> >> >> >> > added to the query statement for partition pruning. >>> >> >> >> > >>> >> >> >> > Hope folks help and review! >>> >> >> >> > Thanks! >>> >> >> >> > Lvhu >>> >> >> >> > >>> >> >> >> >>> >> >> >>> >> >>>