Re: Re: DISCUSS

2023-03-21 Thread Vinoth Chandar
>but when it is used for data expansion, it still involves the need to
redistribute the data records of some data files, thus affecting the
performance.
but expansion of the consistent hash index is an optional operation right?
Sorry, not still fully understanding the differences here,

>Because the hash partition field values under the parquet file in a
columnar storage format are all equal, the added column field hardly
occupies storage space after compression.
Any new meta field added adds other overhead in terms evolving the schema,
so forth. are you suggesting this is not possible to do without a new meta
field?

On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:

> Hello,
>  I feel very honored that you are interested in my views.
>
>  Here are some of my thoughts marked with blue font.
>
> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
>
> >Thanks for the proposal! Some first set of questions here.
> >
> >>You need to pre-select the number of buckets and use the hash function to
> >determine which bucket a record belongs to.
> >>when building the table according to the estimated amount of data, and it
> >cannot be changed after building the table
> >>When the amount of data in a hash partition is too large, the data in
> that
> >partition will be split into multiple files in the way of Bloom index.
> >
> >All these issues are related to bucket sizing could be alleviated by the
> >consistent hashing index in 0.13? Have you checked it out? Love to hear
> >your thoughts on this.
>
> Hash partitioning is applicable to data tables that cannot give the exact
> capacity of data, but can estimate a rough range. For example, if a company
> currently has 300 million customers in the United States, the company will
> have 7 billion customers in the world at most. In this scenario, using hash
> partitioning to cope with data growth within the known range by directly
> adding files and establishing  bloom filters can still guarantee
> performance.
> The consistent hash bucket index is also very valuable, but when it is
> used for data expansion, it still involves the need to redistribute the
> data records of some data files, thus affecting the performance. When it is
> completely impossible to estimate the range of data capacity, it is very
> suitable to use consistent hashing.
> >> you can directly search the data under the partition, which greatly
> >reduces the scope of the Bloom filter to search for files and reduces the
> >false positive of the Bloom filter.
> >the bloom index is already partition aware and unless you use the global
> >version can achieve this. Am I missing something?
> >
> >Another key thing is - if we can avoid adding a new meta field, that would
> >be great. Is it possible to implement this similar to bucket index, based
> >on jsut table properties?
> Add a hash partition field in the table to implement the hash partition
> function, which can well reuse the existing partition function, and
> involves very few code changes. Because the hash partition field values
> under the parquet file in a columnar storage format are all equal, the
> added column field hardly occupies storage space after compression.
> Of course, it is not necessary to add hash partition fields in the table,
> but to store hash partition fields in the corresponding metadata to achieve
> this function, but it will be difficult to reuse the existing functions.
> The establishment of hash partition and partition pruning during query need
> more time to develop code and test again.
> >On Sat, Feb 18, 2023 at 8:18 PM 吕虎  wrote:
> >
> >> Hi folks,
> >>
> >> Here is my proposal.Thank you very much for reading it.I am looking
> >> forward to your agreement  to create an RFC for it.
> >>
> >> Background
> >>
> >> In order to deal with the problem that the modification of a small
> amount
> >> of local data needs to rewrite the entire partition data, Hudi divided
> the
> >> partition into multiple File Groups, and each File Group is identified
> by
> >> the File ID. In this way, when a small amount of local data is modified,
> >> only the data of the corresponding File Group needs to be rewritten.
> Hudi
> >> consistently maps the given Hudi record to the File ID through the index
> >> mechanism. The mapping relationship between Record Key and File
> Group/File
> >> ID will not change once the first version of Record is determined.
> >>
> >> At present, Hudi's indexes mainly include Bloom filter index, Hbase
> >> index and bucket index. The Bloom filter index has a false positive
> >> problem. When a large amount of data results in a large number of File
> >> Groups, the false positive problem will magnify and lead to poor
> >> performance. The Hbase index depends on the external Hbase database, and
> >> may be inconsistent, which will ultimately increase the operation and
> >> maintenance costs. Bucket index makes each bucket of the bucket index
> >> correspond to a File Group. You need to pre-select the number of 

Re: About for 0.12.3 Release Timeline

2023-03-21 Thread Vinoth Chandar
Hi,

Given there are some critical regressions set to go, I would prefer to
scope down 0.12.3 to just the few PRs and get something out asap. Once
everyone returns, we can drive a 0.12.4 on top? We can then take even till
end of April
Others, thoughts?

On Mon, Mar 20, 2023 at 23:39 Forward Xu  wrote:

> Hi folks,
>
> How about April 10th as our release date for 0.12.3? Considering that from
> now to April 10th includes the traditional Chinese festival Qingming
> Festival and the full testing schedule.
>
> ForwardXu
> Best
>


About for 0.12.3 Release Timeline

2023-03-21 Thread Forward Xu
Hi folks,

How about April 10th as our release date for 0.12.3? Considering that from
now to April 10th includes the traditional Chinese festival Qingming
Festival and the full testing schedule.

ForwardXu
Best