Hi I am also in favour of restraining KEEP_LATEST_FILE_VERSIONS policy.
I suspect many people are using hudi as a solution to manage parquet which is consumed by downstream tools. In my usecase I don’t want to make any change in consumer logic for downstream tools so KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP = "1" works. Also I can control when to start consuming data from downstream jobs so I don’t face issue with files deleted while running query etc. On Thursday, 13 June 2019, Vinoth Chandar <[email protected]> wrote: > yes. we always keep atleast one version out, since deleting it could fail > the queries.. > Thanks for the feedback. Will not remove it then. > > We can work towards Impala support for your use-case, as a long term > solution. And revisit later may be > > On Tue, Jun 11, 2019 at 9:54 PM Gary Li <[email protected]> wrote: > > > Thanks, Vinoth. That's very helpful. > > > > When I was using data consumers that don't support hoodie format, I have > to > > use KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP = > "1" > > to keep the parquet files clean, as discussed in > >https://github.com/apache/incubator-hudi/issues/715 . When I use > KEEP_LATEST_COMMITS with hoodie.cleaner.commits.retained = "1", I will > > still have two versions of parquet files. > > > > Comparing with running batch jobs, this way actually make my situation > much > > better. So I'd recommend not to retire KEEP_LATEST_FILE_VERSIONS and some > > people might find it useful as I do. > > > > Thanks! > > Gary > > > > > > On Tue, Jun 11, 2019 at 9:20 AM Vinoth Chandar <[email protected]> > wrote: > > > > > Cool. So, cleaning policy determines how we clean up older versions of > > file > > > groups (simplistically old parquet and log files), to bound storage > > growth, > > > > > > KEEP_LATEST_COMMITS (default) : Retains (does not delete) any file > > (slice) > > > that was touched in the last X commits. The idea here is that you are > > able > > > to pull the incremental changes worth upto X commits. > > > KEEP_LATEST_FILE_VERSIONS : If you are not interested in incremental > > pull > > > at all, you can choose to just retain X files (slices) per file group > > (i.e > > > files that share same prefix) instead. This could result in fewer files > > in > > > some cases. > > > > > > In practice, we always use KEEP_LATEST_COMMITS, I keep thinking about > > > starting a discussion to retire LATEST_FILE_VERSIONS actually.. > > > > > > Hope that helps. > > > > > > On Tue, Jun 11, 2019 at 9:05 AM Gary Li <[email protected]> > > wrote: > > > > > > > Hello Vinoth, > > > > > > > > Yes, that’s what I mean. > > > > > > > > Thanks > > > > Gary > > > > > > > > On Tue, Jun 11, 2019 at 9:03 AM Vinoth Chandar <[email protected]> > > > wrote: > > > > > > > > > Hi Gary, > > > > > > > > > > Do you mean cleaning policy? KEEP_LATEST_FILE_VERSIONS vs > > > > > KEEP_LATEST_COMMITS ? > > > > > > > > > > Thanks > > > > > VInoth > > > > > > > > > > On Mon, Jun 10, 2019 at 9:47 PM Gary Li <[email protected]> > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > I am a little confused when I was looking at the compaction > policy. > > > > What > > > > > is > > > > > > the difference between KEEP_LATEST_COMMIT vs KEEP_LATEST_VERSION? > > > What > > > > is > > > > > > the exact definition of "COMMIT" and "VERSION"? > > > > > > > > > > > > Thanks, > > > > > > Gary > > > > > > > > > > > > > > > > > > > > >
