[jira] [Commented] (SOLR-15371) Backups randomly fail sometimes

Jason Gerlowski (Jira) Tue, 15 Jun 2021 12:10:07 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-15371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363861#comment-17363861
 ]


Jason Gerlowski commented on SOLR-15371:
----------------------------------------

bq. I may have forgot to mention, but the directory does not get created for 
the failing shard

This makes sense I think.  The directory corresponding to "location+name" ( 
{{file:///mnt/solr-backups/search/search-06-14-2021}} in your case) is created 
by the overseer node 
[here|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/core/src/java/org/apache/solr/cloud/api/collections/BackupCmd.java#L97]
 - right before Solr sends any core-level requests to backup the individual 
cores.  So that location+name base dir should definitely exist - if the 
{{createDirectory}} call failed then Solr wouldn't've made any of the per-shard 
requests.  And it also makes sense that the core-backups that finish 
successfully are able to create sub-trees of their own backed up files within 
that directory.

I can only think of two things that'd cause the behavior your seeing:
# There is some network blip with your NFS that you don't know about.  Or maybe 
your NFS isn't set up to guarantee absolutely up-to-date information in terms 
of file existence/contents.  But you said you've done your due-diligence in 
checking for errors of that sort.
# If {{LocalFileSystemRepository.createDirectory}} is synchronous, then it's 
impossible for code triggered after that call to report the dir missing.  But 
if {{LocalFileSystemRepository.createDirectory}} is asynchronous in some way 
it's easier to imagine this happening.  LFSR.createDirectory is implemented by 
a passthrough to  
[java.nio.file.Files.createDirectory|https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/file/Files.html#createDirectory(java.nio.file.Path,java.nio.file.attribute.FileAttribute...)]
 - which is well documented and says nothing about asynchronicity.  I know the 
nio library does support some asynchronous operations so maybe this is one of 
those - but most of the information I'm finding online says that directory 
operations are synchronous.

Something to dig into I guess.

> Backups randomly fail sometimes
> -------------------------------
>
>                 Key: SOLR-15371
>                 URL: https://issues.apache.org/jira/browse/SOLR-15371
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore
>    Affects Versions: 8.5.2, 8.8.2
>            Reporter: Roy Perkins
>            Priority: Major
>
> Hi, we have an issue where sometimes one shard fails to backup due to what 
> might be a race condition in creating the folder/starting the backup.  When 
> this happens, we have to restart the first server in a shard to get the 
> backup to succeed again.  The cluster backs up to a shared NFS mount.  4/5 
> times the backup goes fine without issues (there is even another collection 
> that the backup will run for later in the morning that will succeed fine even 
> though it's all the same servers)  Below is the error I get.
> {code:java}
> "Response":"Failed to backup core=slprod_shard4_replica_n6 because 
> org.apache.solr.common.SolrException: Directory to contain snapshots doesn't 
> exist: file:///mnt/solr_backups/slprod/slprod-04-25-2021. Note that 
> Backup/Restore of a SolrCloud collection requires a shared file system 
> mounted at the same path on all nodes!"},
> {code}
> And below is the line I use to backup with (obviously with bash variables set 
> earlier in the script)
> {code:java}
> curl -s 
> "http://localhost:8983/solr/admin/collections?action=BACKUP&name=${COLLECTION}-${DATE}&collection=${COLLECTION}&location=${BACKUP_PATH}&async=1000";
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-15371) Backups randomly fail sometimes

Reply via email to