Re:Re:Re: Re: Re: DISCUSS

吕虎 Sun, 23 Apr 2023 23:39:18 -0700

Hi folks,
          I haven't received your reply for a long time. I think you must have 
something more important to do. Am I right? : )


At 2023-03-31 21:40:46, "吕虎" <lvh...@163.com> wrote:
>Hi Vinoth, I'm glad to receive your letter. Here are some of my thoughts.
>At 2023-03-31 10:17:52, "Vinoth Chandar" <vin...@apache.org> wrote:
>>I think we can focus more on validating the hash index + bloom filter vs
>>consistent hash index more first. Have you looked at RFC-08, which is a
>>kind of hash index as well, except it stores the key => file group mapping
>>externally.
>
>      The idea of RFC-08  Index (rowKey ->pationPath, fileID) is very similar 
> to HBase index, but its index is implemented internally in Hudi, so there is 
> no need to worry about consistency issues. Index  can be written to HFiles 
> quickly, but when reading, it is necessary to read from multiple HFiles, so 
> the performance of reading an index can be a problem. Therefore, RFC proposer 
> naturally thought of using hash buckets to partially solve this problems. 
> HBase's solution to multiple HFILE files is to add a maximum and minimum 
> index and a Bloom filter index. In Hudi, you can directly create a maximum 
> and minimum index and a Bloom filter index for FileGroups, eliminating the 
> need to store the index in HFILE; Another solution is to do a compaction on 
> HFILE files, but it also adds a burden to hudi.We need to consider the 
> performance of reading HFile well when using RFC-08.
>
>Therefore, I believe that hash partition  + bloom filter is still the simplest 
>and most effective solution for predictable data growth in a small range.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>At 2023-03-31 10:17:52, "Vinoth Chandar" <vin...@apache.org> wrote:
>>I think we can focus more on validating the hash index + bloom filter vs
>>consistent hash index more first. Have you looked at RFC-08, which is a
>>kind of hash index as well, except it stores the key => file group mapping
>>externally.
>>
>>On Fri, Mar 24, 2023 at 2:14 AM 吕虎 <lvh...@163.com> wrote:
>>
>>> Hi Vinoth, I am very happy to receive your reply. Here are some of my
>>> thoughts。
>>>
>>> At 2023-03-21 23:32:44, "Vinoth Chandar" <vin...@apache.org> wrote:
>>> >>but when it is used for data expansion, it still involves the need to
>>> >redistribute the data records of some data files, thus affecting the
>>> >performance.
>>> >but expansion of the consistent hash index is an optional operation right?
>>>
>>> >Sorry, not still fully understanding the differences here,
>>> I'm sorry I didn't make myself clearly. The expansion I mentioned last
>>> time refers to data records increase in hudi table.
>>> The difference between consistent hash index and hash partition with Bloom
>>> filters index is how to deal with  data increase:
>>> For consistent hash index, the way of splitting the file is used.
>>> Splitting files affects performance, but can permanently work effectively.
>>> So consistent hash index is  suitable for scenarios where data increase
>>> cannot be estimated or  data will increase large.
>>> For hash partitions with Bloom filters index, the way of creating  new
>>> files is used. Adding new files does not affect performance, but if there
>>> are too many files, the probability of false positives in the Bloom filters
>>> will increase. So hash partitions with Bloom filters index is  suitable for
>>> scenario where data increase can be estimated over a relatively small range.
>>>
>>>
>>> >>Because the hash partition field values under the parquet file in a
>>> >columnar storage format are all equal, the added column field hardly
>>> >occupies storage space after compression.
>>> >Any new meta field added adds other overhead in terms evolving the schema,
>>> >so forth. are you suggesting this is not possible to do without a new meta
>>> >field?
>>>
>>> No new meta field  implementation is a more elegant implementation, but
>>> for me, who is not yet familiar with the Hudi source code, it is somewhat
>>> difficult to implement, but it is not a problem for experts. If you want to
>>> implement it without adding new meta fields, I hope I can participate in
>>> some simple development, and I can also learn how experts can do it.
>>>
>>>
>>> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎 <lvh...@163.com> wrote:
>>> >
>>> >> Hello,
>>> >>      I feel very honored that you are interested in my views.
>>> >>
>>> >>      Here are some of my thoughts marked with blue font.
>>> >>
>>> >> At 2023-03-16 13:18:08, "Vinoth Chandar" <vin...@apache.org> wrote:
>>> >>
>>> >> >Thanks for the proposal! Some first set of questions here.
>>> >> >
>>> >> >>You need to pre-select the number of buckets and use the hash
>>> function to
>>> >> >determine which bucket a record belongs to.
>>> >> >>when building the table according to the estimated amount of data,
>>> and it
>>> >> >cannot be changed after building the table
>>> >> >>When the amount of data in a hash partition is too large, the data in
>>> >> that
>>> >> >partition will be split into multiple files in the way of Bloom index.
>>> >> >
>>> >> >All these issues are related to bucket sizing could be alleviated by
>>> the
>>> >> >consistent hashing index in 0.13? Have you checked it out? Love to hear
>>> >> >your thoughts on this.
>>> >>
>>> >> Hash partitioning is applicable to data tables that cannot give the
>>> exact
>>> >> capacity of data, but can estimate a rough range. For example, if a
>>> company
>>> >> currently has 300 million customers in the United States, the company
>>> will
>>> >> have 7 billion customers in the world at most. In this scenario, using
>>> hash
>>> >> partitioning to cope with data growth within the known range by directly
>>> >> adding files and establishing  bloom filters can still guarantee
>>> >> performance.
>>> >> The consistent hash bucket index is also very valuable, but when it is
>>> >> used for data expansion, it still involves the need to redistribute the
>>> >> data records of some data files, thus affecting the performance. When
>>> it is
>>> >> completely impossible to estimate the range of data capacity, it is very
>>> >> suitable to use consistent hashing.
>>> >> >> you can directly search the data under the partition, which greatly
>>> >> >reduces the scope of the Bloom filter to search for files and reduces
>>> the
>>> >> >false positive of the Bloom filter.
>>> >> >the bloom index is already partition aware and unless you use the
>>> global
>>> >> >version can achieve this. Am I missing something?
>>> >> >
>>> >> >Another key thing is - if we can avoid adding a new meta field, that
>>> would
>>> >> >be great. Is it possible to implement this similar to bucket index,
>>> based
>>> >> >on jsut table properties?
>>> >> Add a hash partition field in the table to implement the hash partition
>>> >> function, which can well reuse the existing partition function, and
>>> >> involves very few code changes. Because the hash partition field values
>>> >> under the parquet file in a columnar storage format are all equal, the
>>> >> added column field hardly occupies storage space after compression.
>>> >> Of course, it is not necessary to add hash partition fields in the
>>> table,
>>> >> but to store hash partition fields in the corresponding metadata to
>>> achieve
>>> >> this function, but it will be difficult to reuse the existing functions.
>>> >> The establishment of hash partition and partition pruning during query
>>> need
>>> >> more time to develop code and test again.
>>> >> >On Sat, Feb 18, 2023 at 8:18 PM 吕虎 <lvh...@163.com> wrote:
>>> >> >
>>> >> >> Hi folks,
>>> >> >>
>>> >> >> Here is my proposal.Thank you very much for reading it.I am looking
>>> >> >> forward to your agreement  to create an RFC for it.
>>> >> >>
>>> >> >> Background
>>> >> >>
>>> >> >> In order to deal with the problem that the modification of a small
>>> >> amount
>>> >> >> of local data needs to rewrite the entire partition data, Hudi
>>> divided
>>> >> the
>>> >> >> partition into multiple File Groups, and each File Group is
>>> identified
>>> >> by
>>> >> >> the File ID. In this way, when a small amount of local data is
>>> modified,
>>> >> >> only the data of the corresponding File Group needs to be rewritten.
>>> >> Hudi
>>> >> >> consistently maps the given Hudi record to the File ID through the
>>> index
>>> >> >> mechanism. The mapping relationship between Record Key and File
>>> >> Group/File
>>> >> >> ID will not change once the first version of Record is determined.
>>> >> >>
>>> >> >>     At present, Hudi's indexes mainly include Bloom filter index,
>>> Hbase
>>> >> >> index and bucket index. The Bloom filter index has a false positive
>>> >> >> problem. When a large amount of data results in a large number of
>>> File
>>> >> >> Groups, the false positive problem will magnify and lead to poor
>>> >> >> performance. The Hbase index depends on the external Hbase database,
>>> and
>>> >> >> may be inconsistent, which will ultimately increase the operation and
>>> >> >> maintenance costs. Bucket index makes each bucket of the bucket index
>>> >> >> correspond to a File Group. You need to pre-select the number of
>>> buckets
>>> >> >> and use the hash function to determine which bucket a record belongs
>>> to.
>>> >> >> Therefore, you can directly determine the mapping relationship
>>> between
>>> >> the
>>> >> >> Record Key and the File Group/File ID through the hash function.
>>> Using
>>> >> the
>>> >> >> bucket index, you need to determine the number of buckets in advance
>>> >> when
>>> >> >> building the table according to the estimated amount of data, and it
>>> >> cannot
>>> >> >> be changed after building the table. The unreasonable number of
>>> buckets
>>> >> >> will seriously affect performance. Unfortunately, the amount of data
>>> is
>>> >> >> often unpredictable and will continue to grow.
>>> >> >>
>>> >> >> Hash partition feasibility
>>> >> >>
>>> >> >>      In this context, I put forward the idea of hash partitioning.
>>> The
>>> >> >> principle is similar to bucket index, but in addition to the
>>> advantages
>>> >> of
>>> >> >> bucket index, hash partitioning can retain the Bloom index. When the
>>> >> amount
>>> >> >> of data in a hash partition is too large, the data in that partition
>>> >> will
>>> >> >> be split into multiple files in the way of Bloom index. Therefore,
>>> the
>>> >> >> problem that bucket index depends heavily on the number of buckets
>>> does
>>> >> not
>>> >> >> exist in the hash partition. Compared with the Bloom index, when
>>> >> searching
>>> >> >> for a data, you can directly search the data under the partition,
>>> which
>>> >> >> greatly reduces the scope of the Bloom filter to search for files and
>>> >> >> reduces the false positive of the Bloom filter.
>>> >> >>
>>> >> >> Design of a simple hash partition implementation
>>> >> >>
>>> >> >> The idea is to use the capabilities of the ComplexKeyGenerator to
>>> >> >> implement hash partitioning. Hash partition field is one of the
>>> >> partition
>>> >> >> fields of the ComplexKeyGenerator.
>>> >> >>
>>> >> >>    When hash.partition.fields is specified and partition.fields
>>> contains
>>> >> >> _hoodie_hash_partition, a column named _hoodie_hash_partition will be
>>> >> added
>>> >> >> in this table as one of the partition key.
>>> >> >>
>>> >> >> If predicates of hash.partition.fields appear in the query statement,
>>> >> the
>>> >> >> _hoodie_hash_partition = X predicate will be automatically added to
>>> the
>>> >> >> query statement for partition pruning.
>>> >> >>
>>> >> >> Advantages of this design: simple implementation, no modification of
>>> >> core
>>> >> >> functions, so low risk.
>>> >> >>
>>> >> >> The above design has been implemented in pr 7984.
>>> >> >>
>>> >> >> https://github.com/apache/hudi/pull/7984
>>> >> >>
>>> >> >> How to use hash partition in spark data source can refer to
>>> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>>> >> >>
>>> >> >> #testHashPartition
>>> >> >>
>>> >> >> Perhaps for experts, the implementation of PR is not elegant enough.
>>> I
>>> >> >> also look forward to a more elegant implementation, from which I can
>>> >> learn
>>> >> >> more.
>>> >> >>
>>> >> >> Problems with hash partition:
>>> >> >>
>>> >> >> Because SparkSQL does not know the existence of hash partitions, when
>>> >> two
>>> >> >> hash-partitioned tables perform associated queries, it may not get
>>> the
>>> >> >> optimal execution plan. However, because hudi still has the ability
>>> to
>>> >> >> prune a single table, the performance of two tables can still be
>>> greatly
>>> >> >> improved by performing associated queries, compared with no hash
>>> >> >> partitioning.
>>> >> >>
>>> >> >> The same problem also exists in the operation of group by. SparkSQL
>>> does
>>> >> >> not know that the aggregation operation has been completed in the
>>> hash
>>> >> >> partition. It may aggregate the results of different partition
>>> >> aggregation
>>> >> >> again. However, due to the data volume after partition aggregation is
>>> >> very
>>> >> >> small, the redundant and useless aggregation operation has little
>>> >> impact on
>>> >> >> the overall performance.
>>> >> >>
>>> >> >> I'm not sure whether bucket indexes will have the same problem as
>>> hash
>>> >> >> partitions.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> At 2023-02-17 05:18:19, "Y Ethan Guo" <yi...@apache.org> wrote:
>>> >> >> >+1 Thanks Lvhu for bringing up the idea.  As Alexey suggested, it
>>> >> would be
>>> >> >> >good for you to write down the proposal with design details for
>>> >> discussion
>>> >> >> >in the community.
>>> >> >> >
>>> >> >> >On Thu, Feb 16, 2023 at 11:28 AM Alexey Kudinkin <
>>> ale...@onehouse.ai>
>>> >> >> wrote:
>>> >> >> >
>>> >> >> >> Thanks for your contribution, Lvhu!
>>> >> >> >>
>>> >> >> >> I think we should actually kick-start this effort with an small
>>> RFC
>>> >> >> >> outlining proposed changes first, as this is modifying the core
>>> >> >> read-flow
>>> >> >> >> for all Hudi tables and we want to make sure our approach there is
>>> >> >> >> rock-solid.
>>> >> >> >>
>>> >> >> >> On Thu, Feb 16, 2023 at 6:34 AM 吕虎 <lvh...@163.com> wrote:
>>> >> >> >>
>>> >> >> >> > Hi folks,
>>> >> >> >> >       PR 7984【 https://github.com/apache/hudi/pull/7984 】
>>> >> implements
>>> >> >> >> hash
>>> >> >> >> > partitioning.
>>> >> >> >> >       As you know, It is often difficult to find an appropriate
>>> >> >> partition
>>> >> >> >> > key in the existing big data. Hash partitioning can easily solve
>>> >> this
>>> >> >> >> > problem. it can greatly improve the performance of hudi's big
>>> data
>>> >> >> >> > processing.
>>> >> >> >> >       The idea is to use the hash partition field as one of the
>>> >> >> partition
>>> >> >> >> > fields of the ComplexKeyGenerator， so this PR  implementation
>>> does
>>> >> not
>>> >> >> >> > involve logic modification of core code.
>>> >> >> >> >       The codes are easy to review, but I think hash partition
>>> is
>>> >> very
>>> >> >> >> > usefull. we really need it.
>>> >> >> >> >       How to use hash partition in spark data source can refer
>>> to
>>> >> >> >> >
>>> >> >> >>
>>> >> >>
>>> >>
>>> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>>> >> >> >> >  #testHashPartition
>>> >> >> >> >
>>> >> >> >> >       No public API or user-facing feature change or any
>>> >> performance
>>> >> >> >> > impact if the hash partition parameters are not specified.
>>> >> >> >> >
>>> >> >> >> >       When hash.partition.fields is specified and
>>> partition.fields
>>> >> >> >> > contains _hoodie_hash_partition, a column named
>>> >> _hoodie_hash_partition
>>> >> >> >> will
>>> >> >> >> > be added in this table as one of the partition key.
>>> >> >> >> >
>>> >> >> >> >       If predicates of hash.partition.fields appear in the query
>>> >> >> >> > statement, the _hoodie_hash_partition = X predicate will be
>>> >> >> automatically
>>> >> >> >> > added to the query statement for partition pruning.
>>> >> >> >> >
>>> >> >> >> >         Hope folks help and review！
>>> >> >> >> >       Thanks!
>>> >> >> >> > Lvhu
>>> >> >> >> >
>>> >> >> >>
>>> >> >>
>>> >>
>>>

Re:Re:Re: Re: Re: DISCUSS

Reply via email to