Thanks, Vinoth. That's very helpful. When I was using data consumers that don't support hoodie format, I have to use KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP = "1" to keep the parquet files clean, as discussed in https://github.com/apache/incubator-hudi/issues/715. When I use KEEP_LATEST_COMMITS with hoodie.cleaner.commits.retained = "1", I will still have two versions of parquet files.
Comparing with running batch jobs, this way actually make my situation much better. So I'd recommend not to retire KEEP_LATEST_FILE_VERSIONS and some people might find it useful as I do. Thanks! Gary On Tue, Jun 11, 2019 at 9:20 AM Vinoth Chandar <[email protected]> wrote: > Cool. So, cleaning policy determines how we clean up older versions of file > groups (simplistically old parquet and log files), to bound storage growth, > > KEEP_LATEST_COMMITS (default) : Retains (does not delete) any file (slice) > that was touched in the last X commits. The idea here is that you are able > to pull the incremental changes worth upto X commits. > KEEP_LATEST_FILE_VERSIONS : If you are not interested in incremental pull > at all, you can choose to just retain X files (slices) per file group (i.e > files that share same prefix) instead. This could result in fewer files in > some cases. > > In practice, we always use KEEP_LATEST_COMMITS, I keep thinking about > starting a discussion to retire LATEST_FILE_VERSIONS actually.. > > Hope that helps. > > On Tue, Jun 11, 2019 at 9:05 AM Gary Li <[email protected]> wrote: > > > Hello Vinoth, > > > > Yes, that’s what I mean. > > > > Thanks > > Gary > > > > On Tue, Jun 11, 2019 at 9:03 AM Vinoth Chandar <[email protected]> > wrote: > > > > > Hi Gary, > > > > > > Do you mean cleaning policy? KEEP_LATEST_FILE_VERSIONS vs > > > KEEP_LATEST_COMMITS ? > > > > > > Thanks > > > VInoth > > > > > > On Mon, Jun 10, 2019 at 9:47 PM Gary Li <[email protected]> > > wrote: > > > > > > > Hello, > > > > > > > > I am a little confused when I was looking at the compaction policy. > > What > > > is > > > > the difference between KEEP_LATEST_COMMIT vs KEEP_LATEST_VERSION? > What > > is > > > > the exact definition of "COMMIT" and "VERSION"? > > > > > > > > Thanks, > > > > Gary > > > > > > > > > >
