Re: Regular minor/patch releases

2021-12-15 Thread Vinoth Chandar
Hi all,

Thanks for chiming in with the feedback. Looks like there is broad support
for this.

Responding to few of the views below.

>With the rush in features without enough tests, I'm afraid the major
release version is never ready for production,
While I agree with you, don't want to be very idealistic here either. 0.10
for e.g had a lot of testing on RCs and bug fixes after as well. And some
of the features were hardened at places at Uber before we released, but
open source major releases are generally rough (you can even see how rough
newer Spark versions are for e.g), and the community puts in the effort to
make it more and more stable going forward. Hudi's problem IMO has been
that we have done only major releases from 0.6 to 0.10 (given our resource
crunch during the pandemic times). Now, is a good time to revisit this.

>when fixing bugs against the master branch, the contributors/committers
should also open a new PR
We can try this and encourage this always. I am just worried that this adds
more burden on contributors and things may get missed. IMO we can pick two
RMs at any time. One for the next major release and one for the next minor
release and have them shepherd the bug fixes through? We mark JIRAs with
two fix versions.

>And for minor releases, there should only include the bug fixes, no
breaking change, no feature, it should not be a hard work i think.
+100 on this. otherwise it defeats the purpose of the minor release.

Thanks
Vinoth

On Wed, Dec 15, 2021 at 7:22 AM leesf  wrote:

> +1
>
> We could create new branches such as release-0.10 as the master branch for
> 0.10.0, 0.10.1 .etc version release, and when fixing bugs against the
> master branch, the contributors/committers should also open a new PR
> against the release-0.10 branch if needed. That would avoid cherry-picking
> all bug fixes from master to release-0.10 at one time and cause so many
> conflicts. You would see the Spark[1] and Flink[2] community also
> maintaining a multi-master branch as well.
>
> [1] https://github.com/apache/spark/tree/branch-3.1
> https://github.com/apache/spark/tree/branch-3.2
> [2] https://github.com/apache/flink/tree/release-1.12
> https://github.com/apache/flink/tree/release-1.13
>
> vino yang  于2021年12月15日周三 18:12写道:
>
> > +1
> >
> > Agree that minor release mostly for bug fix purpose.
> >
> > Best,
> > Vino
> >
> > Danny Chan  于2021年12月15日周三 10:35写道:
> >
> > > I guess we must do that for current rapid development and iteration. As
> > for
> > > the release 0.10.0, after the announcement of only a few days we have
> > > received a bunch of bugs reported by the github issues: such as
> > >
> > > - the empty meta file: https://github.com/apache/hudi/issues/4249
> > > - and the timeline based marker files:
> > > https://github.com/apache/hudi/issues/4230
> > >
> > > With the rush in features without enough tests, I'm afraid the major
> > > release version is never ready for production, unless there is
> production
> > > validation like in Uber internal.
> > >
> > > And for minor releases, there should only include the bug fixes, no
> > > breaking change, no feature, it should not be a hard work i think.
> > >
> > > Best,
> > > Danny
> > >
> > > Sivabalan 于2021年12月14日 周二上午4:06写道:
> > >
> > > > +1 in general. but yeah, not sure if we have resources to do this for
> > > every
> > > > major release.
> > > >
> > > > On Mon, Dec 13, 2021 at 10:01 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > In the past we had plans for minor releases [1], but invariably we
> > end
> > > up
> > > > > doing major ones, which also deliver the bug fixes.
> > > > >
> > > > > The reason was the cost involved in doing a release. We have made
> > some
> > > > good
> > > > > progress towards regression/integration test, which prompts me to
> > > revive
> > > > > this.
> > > > >
> > > > > What does everyone think about a monthly bugfix release on the last
> > > > > major/minor version. (not on every major release, we still don't
> have
> > > > > enough contributors to pull that off IMO). So we would be trying to
> > do
> > > a
> > > > > 0.10.1 early jan for e.g, in this model?
> > > > >
> > > > > [1]
> > > https://cwiki.apache.org/confluence/display/HUDI/Release+Management
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>


Re: Regular minor/patch releases

2021-12-15 Thread leesf
+1

We could create new branches such as release-0.10 as the master branch for
0.10.0, 0.10.1 .etc version release, and when fixing bugs against the
master branch, the contributors/committers should also open a new PR
against the release-0.10 branch if needed. That would avoid cherry-picking
all bug fixes from master to release-0.10 at one time and cause so many
conflicts. You would see the Spark[1] and Flink[2] community also
maintaining a multi-master branch as well.

[1] https://github.com/apache/spark/tree/branch-3.1
https://github.com/apache/spark/tree/branch-3.2
[2] https://github.com/apache/flink/tree/release-1.12
https://github.com/apache/flink/tree/release-1.13

vino yang  于2021年12月15日周三 18:12写道:

> +1
>
> Agree that minor release mostly for bug fix purpose.
>
> Best,
> Vino
>
> Danny Chan  于2021年12月15日周三 10:35写道:
>
> > I guess we must do that for current rapid development and iteration. As
> for
> > the release 0.10.0, after the announcement of only a few days we have
> > received a bunch of bugs reported by the github issues: such as
> >
> > - the empty meta file: https://github.com/apache/hudi/issues/4249
> > - and the timeline based marker files:
> > https://github.com/apache/hudi/issues/4230
> >
> > With the rush in features without enough tests, I'm afraid the major
> > release version is never ready for production, unless there is production
> > validation like in Uber internal.
> >
> > And for minor releases, there should only include the bug fixes, no
> > breaking change, no feature, it should not be a hard work i think.
> >
> > Best,
> > Danny
> >
> > Sivabalan 于2021年12月14日 周二上午4:06写道:
> >
> > > +1 in general. but yeah, not sure if we have resources to do this for
> > every
> > > major release.
> > >
> > > On Mon, Dec 13, 2021 at 10:01 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > In the past we had plans for minor releases [1], but invariably we
> end
> > up
> > > > doing major ones, which also deliver the bug fixes.
> > > >
> > > > The reason was the cost involved in doing a release. We have made
> some
> > > good
> > > > progress towards regression/integration test, which prompts me to
> > revive
> > > > this.
> > > >
> > > > What does everyone think about a monthly bugfix release on the last
> > > > major/minor version. (not on every major release, we still don't have
> > > > enough contributors to pull that off IMO). So we would be trying to
> do
> > a
> > > > 0.10.1 early jan for e.g, in this model?
> > > >
> > > > [1]
> > https://cwiki.apache.org/confluence/display/HUDI/Release+Management
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: Regular minor/patch releases

2021-12-15 Thread vino yang
+1

Agree that minor release mostly for bug fix purpose.

Best,
Vino

Danny Chan  于2021年12月15日周三 10:35写道:

> I guess we must do that for current rapid development and iteration. As for
> the release 0.10.0, after the announcement of only a few days we have
> received a bunch of bugs reported by the github issues: such as
>
> - the empty meta file: https://github.com/apache/hudi/issues/4249
> - and the timeline based marker files:
> https://github.com/apache/hudi/issues/4230
>
> With the rush in features without enough tests, I'm afraid the major
> release version is never ready for production, unless there is production
> validation like in Uber internal.
>
> And for minor releases, there should only include the bug fixes, no
> breaking change, no feature, it should not be a hard work i think.
>
> Best,
> Danny
>
> Sivabalan 于2021年12月14日 周二上午4:06写道:
>
> > +1 in general. but yeah, not sure if we have resources to do this for
> every
> > major release.
> >
> > On Mon, Dec 13, 2021 at 10:01 AM Vinoth Chandar 
> wrote:
> >
> > > Hi all,
> > >
> > > In the past we had plans for minor releases [1], but invariably we end
> up
> > > doing major ones, which also deliver the bug fixes.
> > >
> > > The reason was the cost involved in doing a release. We have made some
> > good
> > > progress towards regression/integration test, which prompts me to
> revive
> > > this.
> > >
> > > What does everyone think about a monthly bugfix release on the last
> > > major/minor version. (not on every major release, we still don't have
> > > enough contributors to pull that off IMO). So we would be trying to do
> a
> > > 0.10.1 early jan for e.g, in this model?
> > >
> > > [1]
> https://cwiki.apache.org/confluence/display/HUDI/Release+Management
> > >
> > > Thanks
> > > Vinoth
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [DISCUSS] Propose Consistent Hashing Indexing for Dynamic Bucket Number

2021-12-15 Thread Yuwei Xiao
The RFC pr link :)
https://github.com/apache/hudi/pull/4326

I am personally inclined to add a new index option (DYANMIC_BUCKET_INDEX),
to keep a clean & performable hash index (BUCEKT_INDEX) option for
experienced users.
We could also save potential migration trouble by ensuring a consistent
behavior of the hash index.

The impact of resizing (split/merge) has been described in the RFC. In
short, the split/merge is async embedded in the clustering process and
doesn't block concurrent readers & writers. By controlling its processing
granularity, we can further alleviate the performance impact on concurrent
operations.

I agree that merge (shrink table) may be a very infrequent operation. But I
guess we still need to implement it for completeness :)

On Tue, Dec 14, 2021 at 1:52 AM Vinoth Chandar  wrote:

> +1 on the overall idea.
>
> I am wondering if we can layer this on top of Hash Index as a way for just
> expanding the number of buckets.
>
> While Split/Merge sounds great, IMO there is significant operational
> overhead to it. Most practical scenarios can be met with ability to expand
> with zero impact as you describe it?
> In fact, back when I worked on voldemort (linkedin's dynamo impl), we never
> shrunk the tables for this reason as well.
>
> In any case, look forward to the RFC. please grab a RFC number!
>
> On Mon, Dec 13, 2021 at 6:24 AM Gary Li  wrote:
>
> > +1, looking forward to the RFC.
> >
> > Best,
> > Gary
> >
> > On Sun, Dec 12, 2021 at 7:12 PM leesf  wrote:
> >
> > > +1 for the improvement to make bucket index more comprehensive and
> > looking
> > > forward to the RFC for more details.
> > >
> > > Yuwei Xiao  于2021年12月10日周五 16:22写道:
> > >
> > > > Dear Hudi Community,
> > > >
> > > > I would like to propose Consistent Hashing Indexing to enable dynamic
> > > > bucket number, saving hyper-parameter tuning for Hudi users.
> > > >
> > > > Currently, we have Bucket Index on landing [1]. It is an effective
> > index
> > > > approach to address the performance issue during Upsert. I observed
> ~3x
> > > > throughput improvement for Upsert in my local setup compared to the
> > Bloom
> > > > Filter approach. However, it requires pre-configure a bucket number
> > when
> > > > creating the table. As described in [1], this imposes two
> limitations:
> > > >
> > > > - Due to the one-one mapping between buckets and file groups, the
> size
> > > of a
> > > > single file group may grow infinitely. Services like compaction will
> > take
> > > > longer because of the larger read/write amplification.
> > > >
> > > > - There may exist data skew because of imbalance data distribution,
> > > > resulting in long-tail read/write.
> > > >
> > > > Based on the above observation, supporting dynamic bucket number is
> > > > necessary, especially for rapidly changing hudi tables. Looking at
> the
> > > > market, Consistent Hashing has been adopted in DB systems[2][3]. The
> > main
> > > > idea of it is to turn the "key->bucket" mapping into
> > > > "key->hash_value->(range mapping)->bucket", constraining the
> re-hashing
> > > > process to touch only several local buckets (e.g., only large file
> > > groups)
> > > > rather than shuffling the whole hash table.
> > > >
> > > > In order to introduce Consistent Hashing to Hudi, we need to consider
> > the
> > > > following issues:
> > > >
> > > > - Storing hashing metadata, such as range mapping infos. Metadata
> size
> > > and
> > > > concurrent updates to metadata should also be considered.
> > > >
> > > > - Splitting & Merging criteria. We need to design a (or several)
> > policies
> > > > to manage 'when and how to split & merge bucket'. A simple policy
> would
> > > be
> > > > splitting in the middle when the file group reaches the size
> threshold.
> > > >
> > > > - Supporting concurrent write & read. The splitting or merging must
> not
> > > > block concurrent writer & reader, and the whole process should be
> fast
> > > > enough (e.g., one bucket at a time) to minimize the impact on other
> > > > operations.
> > > >
> > > > - Integrating splitting & merging process into existing hudi table
> > > service
> > > > pipelines.
> > > >
> > > > I have sketched a prototype design to address the above problems:
> > > >
> > > > - Maintain hashing metadata for each partition (persisted as files),
> > and
> > > > use instant to manage multi-version and concurrent updates of it.
> > > >
> > > > - A flexible framework will be implemented for different pluggable
> > > > policies. The splitting plan, specifying which and how the bucket to
> > > split
> > > > (merge), will be generated during the scheduling (just like how
> > > compaction
> > > > does).
> > > >
> > > > - Dual-write will be activated once the writer observes the
> > splitting(or
> > > > merging) process, upserting records as log files into both old and
> new
> > > > buckets (file groups). Readers can see records once the writer
> > completes,
> > > > regardless of the splitting process.
> > > >
> > > > - The