> I'm a big fan of the second approach ... I think it's fine to make those > assumptions.
I was leaning towards approach (2) as well, so that all sounds good to me. I'd have to think a little more about what defaults make most sense for the non-incremental use case before I could weigh in intelligently there. I think it probably ties into what the default will be for the "incremental" option that you suggested. If the default is incremental=true, then I think it's safe to assume that someone choosing non-incremental is fine blowing away any existing files, doing compression at the end, etc. But if non-incremental is the quiet default, I'm less sure. In any case, thanks for responding - having a general sanity check on the approach gives me enough to get started! Best, Jason On Mon, Aug 2, 2021 at 12:11 PM Houston Putman <houstonput...@gmail.com> wrote: > > Hey Jason, thanks for the thorough investigation here. > > I'm a big fan of the second approach, but in this case I think we'd really > only need 1 option: incremental: true/false > > If the user specifies an incremental backup, we know that: > > They do not want a unique name > The data already there should not be deleted > The data should not be compressed > > I think it's fine to make those assumptions. > > However for the non-incremental use case, some of those options do come into > play. > > I think deleting the existing data is fine, but please correct me if I'm wrong > Compressing data by default should be fine? I see no reason not to, but we > can always make this an option > The unique name thing is fair, but if we do enable cron-scheduled backups, > then we probably do want a unique name per-backup here. > > I think it's fine to change the default behavior going forward if it comes > with a good reason, but for the incremental/non-incremental option > I think a field in the CRD is by far the best option. > > - Houston > > On Fri, Jul 30, 2021 at 3:15 PM Jason Gerlowski <gerlowsk...@gmail.com> wrote: >> >> Hey all, >> >> I've been getting familiar in the last week or two with our new >> operator, and noticed that the way its backups work will miss out on >> the "incremental" efficiency improvements added recently as a part of >> SIP-12. For backups to be done incrementally, an ongoing backup has >> to be able to "see" the files stored by previous backups so that it >> knows which index files to skip over. Our current operator support >> does a few things that prevent this in practice: >> >> - the operator "rm -rf"s all files at the backup location before >> starting each new backup >> - the operator requests each backup at a unique name/location. >> - the operator compresses the backup file tree after finishing each backup >> >> Everything will still work, the backups just won't be nearly as >> efficient for many common usecases as they could be. >> >> There's a few ways we could address this. >> >> In one approach, we could leave 'solrbackup' mostly untouched. For >> "incremental" situations, we would create a new resource-type >> ('solrbackupschedule'? 'solrbackuprepeating'?) that's explicitly >> geared towards repeated backups of the same collections and knows to >> store these all in the same location. Conceivably it could also have >> other useful ops features like cron-job-like scheduling of backups. >> 'solrbackupschedule' would then be our solution for users who want to >> do recurring or repeated backups, and 'solrbackup' could be >> repositioned in the docs as the solution for those doing an ad-hoc, >> standalone backup. >> >> Another approach would be to focus instead on adding configuration >> options to 'solrbackup' that would make it suitable for incremental >> backups: enable/disable backup compression, cleaning/retaining the >> "location" prior to doing a backup, an override for the backup >> location, etc. 'solrbackup' would remain the option for anyone doing >> any sort of backup. (Of course, we could also add a >> solrbackupschedule resource-type as a layer on top of this if the idea >> of cron-like backup triggering is appealing, but it could be >> implemented in terms of managing 'solrbackup' sub-resources that >> perform the actual "work".) >> >> There are tradeoffs for both approaches IMO. >> >> The first approach is simplest in terms of backcompat. It may also >> prove simplest in handling discrepancies between Solr versions >> (incremental backups only supported in v8.9+). But it leaves a >> potential usecase gap: users may take backups frequently enough to >> benefit from "incrementality", but without any sort of defined >> schedule or set periodicity like a 'solrbackupschedule' resource might >> require. It also risks duplicating code as both 'solrbackup' and >> 'solrbackupschedule' would involve similar actions. >> >> OTOH, the second approach is more flexible ('solrbackup' would become >> suitable for any common backup usecase), and 'solrbackupschedule', if >> created, has a really nice conceptual separation being implemented as >> a level on top of 'solrbackup'. But it pays for this all by making >> 'solrbackup' more complex and harder for a non-Solr-SME to "get right" >> out of the box and opening some backcompat questions/challenges. >> Lastly, it'd require us to think carefully about how cleanup and >> resource-deletion works, since this approach will allow multiple >> 'solrbackup' resources to share a backup "location". >> >> Anyone have any thoughts or preferences between those two options? Or >> some third approach I missed? Or even general context around why our >> operator backup support looks the way it does? Really appreciate any >> input! >> >> Best, >> >> Jason >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >> For additional commands, e-mail: dev-h...@solr.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org