Re: Support for configuring the number of remaining snapshots

2024-09-03 Thread Xavier Bai
I pushed a PR for this feature: https://github.com/apache/amoro/pull/3164

Jinsong Zhou  于2024年9月2日周一 19:18写道:

> Hi,
>
> Amoro may need some of its own configuration definitions across multiple
> table formats.
> However, the default configuration value may be taken from some native
> table format configurations such as `history.expire.min-snapshots-to-keep`.
>
> Best,
> Jinsong
>
> On Mon, Sep 2, 2024 at 6:02 PM Paul Lam  wrote:
>
> > +1 for supporting the snapshots to keep.
> >
> > However, Iceberg natively supports
> `history.expire.min-snapshots-to-keep`,
> > should we directly reuse its value?
> >
> > Best,
> > Paul Lam
> >
> > > 2024年9月2日 16:10,Xavier Bai  写道:
> > >
> > > Hi developers,
> > >
> > > Currently, when we execute expired snapshots, we only consider the TTL
> > as a
> > > factor. However, Iceberg supports setting a minimum number of snapshots
> > to
> > > retain. I believe we should also make this configuration option
> > > available(eg. `snapshot.base.keep.min-count`). For developers and table
> > > users, these snapshots can be understood as the update logs of the
> table,
> > > making it easier for users to review the history of updates. If there
> are
> > > no updates for an extended period, Amoro may only retain a single
> > snapshot
> > > after cleanup, which could result in a loss of information for users.
> > >
> > > In addition, the Iceberg community is also working on retaining more
> > > historical snapshot information in an additional folder. I believe this
> > is
> > > a significant requirement
> > >
> > > Best regards,
> > > Xu Bai
> >
> >
>


Support for configuring the number of remaining snapshots

2024-09-02 Thread Xavier Bai
Hi developers,

Currently, when we execute expired snapshots, we only consider the TTL as a
factor. However, Iceberg supports setting a minimum number of snapshots to
retain. I believe we should also make this configuration option
available(eg. `snapshot.base.keep.min-count`). For developers and table
users, these snapshots can be understood as the update logs of the table,
making it easier for users to review the history of updates. If there are
no updates for an extended period, Amoro may only retain a single snapshot
after cleanup, which could result in a loss of information for users.

In addition, the Iceberg community is also working on retaining more
historical snapshot information in an additional folder. I believe this is
a significant requirement

Best regards,
Xu Bai


Re: [VOTE] Release Apache Amoro(incubating) 0.7.0-incubating rc2

2024-07-22 Thread Xavier Bai
+1, I deployed it in our environment and looks good.

Jinsong Zhou  于2024年7月22日周一 14:39写道:

> Hello Incubator PMC,
>
> The Apache Amoro community has voted and approved the release of Apache
> Amoro(incubating) 0.7.0-incubating rc2.
> We now kindly request the IPMC members review and vote for this release.
>
> Apache Amoro (incubating) is a Lakehouse management system built on open
> data lake formats.
>
> Amoro community vote thread:
> https://lists.apache.org/thread/cnjq4vll060qx98r2rzcjk6jn25fhv38
>
> Vote result thread:
> https://lists.apache.org/thread/9b8n0p2n19ovrdlcqxfhy83wxgg9pwyv
>
> The official Apache source release to be deployed to dist.apache.org:
>
> https://dist.apache.org/repos/dist/dev/incubator/amoro/0.7.0-incubating-RC2/
>
> This source release has been signed with a PGP available here (Apache ID:
> jinsongzhou):
> https://downloads.apache.org/incubator/amoro/KEYS
>
> All artifacts to be deployed to the Maven Central Repository:
> https://repository.apache.org/content/repositories/orgapacheamoro-1065
> https://repository.apache.org/content/repositories/orgapacheamoro-1064
>
> Git branch for the release:
> https://github.com/apache/amoro/tree/v0.7.0-rc2
> https://github.com/apache/amoro-shade/tree/v0.7.0-rc2
>
> Please download, verify, and test.
>
> The VOTE will pass after 3 binding approve.
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> To learn more about Apache Amoro, please see https://amoro.apache.org/
>
> Checklist for reference:
>
> [ ] Download links are valid.
> [ ] Checksums and signatures.
> [ ] LICENSE/NOTICE files exist
> [ ] No unexpected binary files
> [ ] All source files have ASF headers
> [ ] Can compile from source
>
> Best,
> Jingsong
>


Re: [VOTE] Release Apache Amoro(incubating) 0.7.0-incubating rc2

2024-07-16 Thread Xavier Bai
+1, Thanks!

Jinsong Zhou  于2024年7月16日周二 21:03写道:

> Hello Amoro devs,
>
> Kindly request the devs review and vote for releasing Amoro
> 0.7.0-incubating.
>
> Apache Amoro(incubating) is a Lakehouse management system built on open
> data lake formats.
>
> The official Apache source release to be deployed to dist.apache.org:
>
>
> https://dist.apache.org/repos/dist/dev/incubator/amoro/0.7.0-incubating-RC2/
>
> This source release has been signed with a PGP available here (Apache ID:
> jinsongzhou):
>
> https://downloads.apache.org/incubator/amoro/KEYS
>
> All artifacts to be deployed to the Maven Central Repository:
> https://repository.apache.org/content/repositories/orgapacheamoro-1065
> https://repository.apache.org/content/repositories/orgapacheamoro-1064
>
> Git branch for the release:
>
> https://github.com/apache/amoro/tree/v0.7.0-rc2
> https://github.com/apache/amoro-shade/tree/v0.7.0-rc2
>
> Please download, verify, and test.
>
> The VOTE will pass after 3 binding approve.
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> To learn more about Apache Amoro, please see https://amoro.apache.org/
>
> Checklist for reference:
>
> [ ] Download links are valid.
> [ ] Checksums and signatures.
> [ ] LICENSE/NOTICE files exist
> [ ] No unexpected binary files
> [ ] All source files have ASF headers
> [ ] Can compile from source
>
> Best,
> Jingsong
>


Re: [VOTE] Release Apache Amoro(incubating) 0.7.0-incubating rc1

2024-07-14 Thread Xavier Bai
+1, I tested in our dev environment and it works fine!

Jinsong Zhou  于2024年7月12日周五 15:58写道:

> Hello Amoro devs,
>
> Kindly request the devs review and vote for releasing Amoro
> 0.7.0-incubating.
>
> Apache Amoro(incubating) is a Lakehouse management system built on open
> data lake formats.
>
> The official Apache source release to be deployed to dist.apache.org:
>
>
> https://dist.apache.org/repos/dist/dev/incubator/amoro/0.7.0-incubating-RC1/
>
> This source release has been signed with a PGP available here (Apache ID:
> jinsongzhou):
>
> https://downloads.apache.org/incubator/amoro/KEYS
>
> All artifacts to be deployed to the Maven Central Repository:
> https://repository.apache.org/content/repositories/orgapacheamoro-1032
> https://repository.apache.org/content/repositories/orgapacheamoro-1063
>
>
> Git branch for the release:
>
> https://github.com/apache/amoro/tree/v0.7.0-rc1
> https://github.com/apache/amoro-shade/tree/v0.7.0-rc1
>
> Please download, verify, and test.
>
> The VOTE will pass after 3 binding approve.
>
> [ ] +1 approve
> [ ] +0 no opinion
> [ ] -1 disapprove with the reason
>
> To learn more about Apache Amoro, please see https://amoro.apache.org/
> 
>
> Checklist for reference:
>
> [ ] Download links are valid.
> [ ] Checksums and signatures.
> [ ] LICENSE/NOTICE files exist
> [ ] No unexpected binary files
> [ ] All source files have ASF headers
> [ ] Can compile from source
>
> Best,
> Jingsong
>


Re: Optimizing the efficiency of some Rest API

2024-07-11 Thread Xavier Bai
Thank you for posting this proposal, some queries are indeed slower and we
can start by optimising the query overhead of the database first

Congxian Qiu  于2024年7月11日周四 18:40写道:

> Hi devs,
> We have encountered some problems with Rest API access not working
> efficiently when using Amoro recently, made a collation, and suggested some
> possible solutions in the doc[1], please let me know what you think about
> it, thanks.
>
>  The problem is summarised below:
> 1. Amoro reads too many rows of data(some of which we do not need) each
> time it accesses the DB, which results in slow access.
> 2. Amoro needs to access the external Catalog(e.g. HiveMetaStore) (multiple
> times), resulting in slow access.
>
> [1] https://docs.qq.com/doc/DQU9sZ2RsdmRYSE1V
>
> Best,
> Congxian
>


Re: [DISCUSS] Remove rocksdb dependencies from project

2024-07-08 Thread Xavier Bai
+1 for option 1

Jinsong Zhou  于2024年7月9日周二 14:25写道:

> Hi,
>
> Thanks for the input from xuba. Yes, Indeed, at this stage, we may still
> need some methods to allow users to add support for RocksDB when needed.
> However, we can consider removing it from the default installation package.
>
> In my opinion, there are two possible methods:
> 1. The first one is to add a Maven profile related to RocksDB, allowing
> users to manually use this profile to build the project and enable this
> feature when needed.
> 2. The second method is to provide a bundled package for RocksDB, allowing
> it to be dynamically added at runtime.
>
> Method one is much easier to implement and we should implement it first,
> and implement method two when we needed later.
>
> What do you think?
>
> Best,
> Jinsong
>
> On Tue, Jul 9, 2024 at 11:09 AM Xavier Bai  wrote:
>
> > There are still many optimisers in PROD environments that have rocksDB
> > storage enabled. Removing dependencies in projects is acceptable, but we
> > should also provide documentation and description of what to do if users
> > want to continue using the feature. For example, there could be support
> for
> > users to add dependencies individually, etc.
> >
> > Jinsong Zhou  于2024年7月8日周一 17:36写道:
> >
> > > Hi devs,
> > >
> > > Recently, I have been working on reducing the size of the Amoro
> > > installation package. Considering the Amoro installation package is
> > almost
> > > 1GB in size, this task really should be done ASAP.
> > >
> > > I found the largest dependent of Amoro is the rocksdb lib (more than
> > 50MB).
> > > It is used to cache some data to disk storage when the memory is not
> > > enough. It is originally used to cache iceberg delete records in
> > > optimizers. But when we have improved the delete records caching with
> > bloom
> > > filter, this feature is really not needed anymore.
> > >
> > > So I am considering removing the rocksdb dependencies from the project
> to
> > > reduce the installation package size.
> > >
> > > I  am looking forward to hearing any point from anyone regarding this
> > > issue.
> > >
> > > Best regards,
> > > Jinsong
> > >
> >
>


Re: [DISCUSS] Remove rocksdb dependencies from project

2024-07-08 Thread Xavier Bai
There are still many optimisers in PROD environments that have rocksDB
storage enabled. Removing dependencies in projects is acceptable, but we
should also provide documentation and description of what to do if users
want to continue using the feature. For example, there could be support for
users to add dependencies individually, etc.

Jinsong Zhou  于2024年7月8日周一 17:36写道:

> Hi devs,
>
> Recently, I have been working on reducing the size of the Amoro
> installation package. Considering the Amoro installation package is almost
> 1GB in size, this task really should be done ASAP.
>
> I found the largest dependent of Amoro is the rocksdb lib (more than 50MB).
> It is used to cache some data to disk storage when the memory is not
> enough. It is originally used to cache iceberg delete records in
> optimizers. But when we have improved the delete records caching with bloom
> filter, this feature is really not needed anymore.
>
> So I am considering removing the rocksdb dependencies from the project to
> reduce the installation package size.
>
> I  am looking forward to hearing any point from anyone regarding this
> issue.
>
> Best regards,
> Jinsong
>


Re: [DISCUSS] Plan to release 0.7.0

2024-05-05 Thread Xavier Bai
+1

Thanks,
Xavier

yuanfeng hu  于2024年5月6日周一 11:09写道:

> +1, Thanks for driving this!
>
> Best,
> Yuanfeng
>
>


Implement JdbcReporter to collect metrics on table maintenance

2024-04-27 Thread Xavier Bai
Hi developers,
I have an initial idea: to collect various metrics during table maintenance
processes, such as compaction, snapshot expiration, and data expiration,
and store this information in a database via JDBC. The purpose of this
approach is twofold: first, to persist these maintenance records for easier
tracking of optimization effects and cost estimation in the future; second,
we can also calculate snapshot and other information when refreshing
tables. This way, historical snapshot information and corresponding data
change trends can be displayed to users. By querying the database rather
than reading metadata files from the data lake via IO requests, we can
reduce service response time and IO pressure.
Please let me know your thoughts.

Thank you,
Xavier Bai


Re: subscribe to amoro

2024-03-31 Thread Xavier Bai
Hi Congxian,
To subscribe, please send a brief email to dev-subscr...@amoro.apache.org

Thanks,
Xu


Re: Amoro repositories migration completed

2024-03-30 Thread Xavier Bai
Thanks a lot for your contribution!

Warm regards
Xu

Jinsong Zhou  于2024年3月30日周六 22:03写道:

> Hi Amoro Devs,
>
> Amoro has been transferred to the Apache repositories:
> https://github.com/apache/incubator-amoro
> https://github.com/apache/incubator-amoro-site
>
> The original git addresses will be redirected to the new ones.
> You may need to update the remote of your local git repositories to the new
> addresses.
> If you encounter any issues during the migration process or have any
> questions, please let me know.
>
> Best,
> Jinsong
>


Re: [DISCUSS] Apache Amoro proposal

2024-02-25 Thread Xavier Bai
+1, I was also one of the early developers on the project, focusing on
solving optimization and compaction issues with the company's Iceberg
tables. I believe that many teams using datalake need a system like Amoro
for effective data lake management and to reduce the complexity of data
lake maintenance. Therefore, contributing it to ASF can enrich the usage
scenarios and enhance datalake management capabilities.

Thanks,
Xu

ConradJam  于2024年2月26日周一 10:12写道:

> +1, I'm one of the developers. At present, I think the community is
> developing well, and this project can help everyone better control the data
> lake. I suggest joining the ASF incubator to let more people know about
> this project and participate in it
>
> Justin Mclean  于2024年2月23日周五 16:44写道:
>
> > Hi,
> >
> > I would like to propose a new project to the ASF incubator - Apache
> Amoro.
> > I’m one of the mentors, but there are a lot of other people involved who
> > have done all of the hard work.
> >
> > Amoro is a Lakehouse management system built on open data lake formats
> > like Apache Iceberg and Apache Paimon (Incubating). Working with compute
> > engines including Apache Flink, Apache Spark, and Trino, Amoro brings
> > pluggable and self-managed features for Lakehouse to provide
> out-of-the-box
> > data warehouse experience, and helps data platforms or products easily
> > build infra-decoupled, stream-and-batch-fused and lake-native
> architecture.
> > You can find the proposal here. [1]
> >
> > We are looking forward to anyone's feedback or questions.
> >
> > Thanks,
> > Justin
> >
> > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>
> --
> Best
>
> ConradJam
>