Re: Fixing Inefficient Solr Operator Backups

Jason Gerlowski Mon, 02 Aug 2021 14:28:51 -0700

> I'm a big fan of the second approach ... I think it's fine to make those 
> assumptions.


I was leaning towards approach (2) as well, so that all sounds good to me.

I'd have to think a little more about what defaults make most sense
for the non-incremental use case before I could weigh in intelligently
there.  I think it probably ties into what the default will be for the
"incremental" option that you suggested.  If the default is
incremental=true, then I think it's safe to assume that someone
choosing non-incremental is fine blowing away any existing files,
doing compression at the end, etc.  But if non-incremental is the
quiet default, I'm less sure.

In any case, thanks for responding - having a general sanity check on
the approach gives me enough to get started!

Best,

Jason

On Mon, Aug 2, 2021 at 12:11 PM Houston Putman <houstonput...@gmail.com> wrote:
>
> Hey Jason, thanks for the thorough investigation here.
>
> I'm a big fan of the second approach, but in this case I think we'd really 
> only need 1 option: incremental: true/false
>
> If the user specifies an incremental backup, we know that:
>
> They do not want a unique name
> The data already there should not be deleted
> The data should not be compressed
>
> I think it's fine to make those assumptions.
>
> However for the non-incremental use case, some of those options do come into 
> play.
>
> I think deleting the existing data is fine, but please correct me if I'm wrong
> Compressing data by default should be fine? I see no reason not to, but we 
> can always make this an option
> The unique name thing is fair, but if we do enable cron-scheduled backups, 
> then we probably do want a unique name per-backup here.
>
> I think it's fine to change the default behavior going forward if it comes 
> with a good reason, but for the incremental/non-incremental option
> I think a field in the CRD is by far the best option.
>
> - Houston
>
> On Fri, Jul 30, 2021 at 3:15 PM Jason Gerlowski <gerlowsk...@gmail.com> wrote:
>>
>> Hey all,
>>
>> I've been getting familiar in the last week or two with our new
>> operator, and noticed that the way its backups work will miss out on
>> the "incremental" efficiency improvements added recently as a part of
>> SIP-12.  For backups to be done incrementally, an ongoing backup has
>> to be able to "see" the files stored by previous backups so that it
>> knows which index files to skip over.  Our current operator support
>> does a few things that prevent this in practice:
>>
>> - the operator "rm -rf"s all files at the backup location before
>> starting each new backup
>> - the operator requests each backup at a unique name/location.
>> - the operator compresses the backup file tree after finishing each backup
>>
>> Everything will still work, the backups just won't be nearly as
>> efficient for many common usecases as they could be.
>>
>> There's a few ways we could address this.
>>
>> In one approach, we could leave 'solrbackup' mostly untouched. For
>> "incremental" situations, we would create a new resource-type
>> ('solrbackupschedule'? 'solrbackuprepeating'?) that's explicitly
>> geared towards repeated backups of the same collections and knows to
>> store these all in the same location.  Conceivably it could also have
>> other useful ops features like cron-job-like scheduling of backups.
>> 'solrbackupschedule' would then be our solution for users who want to
>> do recurring or repeated backups, and 'solrbackup' could be
>> repositioned in the docs as the solution for those doing an ad-hoc,
>> standalone backup.
>>
>> Another approach would be to focus instead on adding configuration
>> options to 'solrbackup' that would make it suitable for incremental
>> backups: enable/disable backup compression, cleaning/retaining the
>> "location" prior to doing a backup, an override for the backup
>> location, etc.  'solrbackup' would remain the option for anyone doing
>> any sort of backup.  (Of course, we could also add a
>> solrbackupschedule resource-type as a layer on top of this if the idea
>> of cron-like backup triggering is appealing, but it could be
>> implemented in terms of managing 'solrbackup' sub-resources that
>> perform the actual "work".)
>>
>> There are tradeoffs for both approaches IMO.
>>
>> The first approach is simplest in terms of backcompat.  It may also
>> prove simplest in handling discrepancies between Solr versions
>> (incremental backups only supported in v8.9+).  But it leaves a
>> potential usecase gap: users may take backups frequently enough to
>> benefit from "incrementality", but without any sort of defined
>> schedule or set periodicity like a 'solrbackupschedule' resource might
>> require.  It also risks duplicating code as both 'solrbackup' and
>> 'solrbackupschedule' would involve similar actions.
>>
>> OTOH, the second approach is more flexible ('solrbackup' would become
>> suitable for any common backup usecase), and 'solrbackupschedule', if
>> created, has a really nice conceptual separation being implemented as
>> a level on top of 'solrbackup'.  But it pays for this all by making
>> 'solrbackup' more complex and harder for a non-Solr-SME to "get right"
>> out of the box and opening some backcompat questions/challenges.
>> Lastly, it'd require us to think carefully about how cleanup and
>> resource-deletion works, since this approach will allow multiple
>> 'solrbackup' resources to share a backup "location".
>>
>> Anyone have any thoughts or preferences between those two options?  Or
>> some third approach I missed?  Or even general context around why our
>> operator backup support looks the way it does?  Really appreciate any
>> input!
>>
>> Best,
>>
>> Jason
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
>> For additional commands, e-mail: dev-h...@solr.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Re: Fixing Inefficient Solr Operator Backups

Reply via email to