I was able to reproduce with the commands in the script here https://paste.apache.org/o4uta. They should work from within any clean 8.9.0 checkout, if anyone else is interested in reproducing.
After reproducing I noticed some errors in the logs that probably point to the real root cause here: Caused by: java.lang.IllegalArgumentException: Unable to parse invalid ShardBackupId: md_shard2_0_0 at org.apache.solr.core.backup.ShardBackupId.from(ShardBackupId.java:59) at org.apache.solr.handler.admin.BackupCoreOp.parseShardBackupId(BackupCoreOp.java:99) at org.apache.solr.handler.admin.BackupCoreOp.execute(BackupCoreOp.java:44) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:367) ... 43 more It looks like the "ShardBackupId" parsing code is too brittle to handle the shard name that results from a splitshard. I've filed SOLR-15696 for this since it's a clear bug. (That said - Jordan - feel free to create and close a "test" JIRA if you'd still like to validate that you have the right permissions to do so.) On Fri, Oct 15, 2021 at 1:49 PM Jason Gerlowski <[email protected]> wrote: > > One last thought for now: > > One additional workaround option to try would be to change the > 'location' path or 'name' parameters provided to Solr at backup time > after any change in the number of shards. If the post-splitshard > backups are stored in a subdirectory location or under a different > name, I suspect you'd avoid whatever missing file weirdness you've run > into so far. > > On Fri, Oct 15, 2021 at 1:45 PM Jason Gerlowski <[email protected]> wrote: > > > > Hi Jordan, > > > > Sorry you're running into problems with incremental backups (and with > > JIRA!). > > > > I suspect you just ran into a temporary problem with JIRA - as long as > > you have an account and are logged in you should 100% be able to file > > JIRA tickets the way you described. Please give it another shot and > > let me know if the issue still persists. If you're still unable to > > create the ticket I'm happy to do so on your behalf. > > > > In terms of the actual behavior you're seeing, I know that splitshards > > can cause hiccups in backup/restore workflows, but I would expect > > those to happen primarily at restore-time. (A change in the number of > > shards effectively prevents any previously backed up data from being > > restored to the original collection. A 'N' shard backup can't be > > restored to a collection with 'N+1' shards.) > > > > So at first glance your report above sounds like a bug. That said, > > I'm returning from an extended leave and am pretty rusty on some of > > the specifics here. I'll work on reproducing this myself and try to > > figure out if this is a real problem or some other "known" limitation > > that I'd forgotten about. > > > > Best, > > > > Jason > > > > On Thu, Sep 30, 2021 at 12:34 PM Jordan Diehl > > <[email protected]> wrote: > > > > > > Hello, > > > I just tried opening a Jira ticket for an issue I was seeing, but after > > > filling out all the info and hitting create it didn't work. Now any time > > > I click create I get an error message saying "The Jira server could not > > > be contacted. This may be a temporary glitch or the server may be down.". > > > I also tried this on 3 other computers, but they all hit the same issue. > > > Once they try to create the bug they are permanently blocked from bug > > > creation. I'm not sure what to do at this point, so this email seems to > > > be my last chance to submit this bug. I would really appreciate if > > > someone could either create this bug, or if there is a known issue with > > > Jira right now, then let me know what that issue is and how I should > > > proceed. > > > > > > > > > Bug Info > > > Summary: Incremental backup attempts fail after a shard split operation > > > has completed > > > Component: Backup/Restore > > > Affects Version: 8.9 > > > Description: > > > I have been attempting to use the incremental backup API on Solr 8.9.0, > > > but while testing in our product we would occasionally get into a state > > > where all subsequent backup attempts would fail. After some triage we > > > found that it was happening to any collection which had undergone a shard > > > split operation. If we did a backup, completed a shard split operation, > > > then attempted another backup, the second backup would fail with a > > > FileNotFound exception relating to the backup id of the second backup as > > > the error message. > > > > > > Steps to reproduce: > > > > > > Create a new collection with no associated backups > > > > > > Run a backup for this collection > > > > > > /admin/collections?action=BACKUP&name=myBackupName&collection=myCollectionName&location=/path/to/my/shared/drive > > > > > > Run a shard split operation > > > > > > /admin/collections?action=SPLITSHARD&collection=name&shard=shardID > > > > > > Attempt another backup > > > > > > > > > Expected Outcome: > > > > > > * If this operation is being blocked intentionally, then I would expect > > > an informative error message explaining why it failed. Otherwise I would > > > expect the backup to complete successfully. > > > > > > > > > Actual Outcome: > > > > > > * The backup operation fails with a NoSuchFileException. > > > > > > NOTE: In the below exception message the number in the file which isn’t > > > found (in this case zk_backup_1) relates to the backup attempt which is > > > currently being attempted. > > > > > > { > > > > > > "responseHeader":{ > > > > > > "status":500, > > > > > > "QTime":54}, > > > > > > "failure":{ > > > > > > > > > "MYIPADDRESS:31018_solr":"org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException:Error > > > from server at null: Error handling 'BACKUPCORE' action"}, > > > > > > "Operation backup caused > > > exception:":"java.nio.file.NoSuchFileException:java.nio.file.NoSuchFileException: > > > /opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1", > > > > > > "exception":{ > > > > > > > > > "msg":"/opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1", > > > > > > "rspCode":-1}, > > > > > > "error":{ > > > > > > "metadata":[ > > > > > > "error-class","org.apache.solr.common.SolrException", > > > > > > "root-error-class","org.apache.solr.common.SolrException"], > > > > > > > > > "msg":"/opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1", > > > > > > "trace":"org.apache.solr.common.SolrException: > > > /opt/hci/solrBackups/reproCollectionBackup/reproCollection/zk_backup_1\n\tat > > > > > > org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:65)\n\tat > > > > > > org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:301)\n\tat > > > > > > org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:257)\n\tat > > > > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216)\n\tat > > > > > > org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:836)\n\tat > > > > > > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:800)\n\tat > > > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:545)\n\tat > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)\n\tat > > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357)\n\tat > > > > > > org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201)\n\tat > > > > > > org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)\n\tat > > > > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)\n\tat > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat > > > > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602)\n\tat > > > > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat > > > > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)\n\tat > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat > > > > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)\n\tat > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n\tat > > > > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)\n\tat > > > > > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)\n\tat > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n\tat > > > > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)\n\tat > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat > > > > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\n\tat > > > > > > org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)\n\tat > > > > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n\tat > > > > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat > > > > > > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)\n\tat > > > > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat > > > org.eclipse.jetty.server.Server.handle(Server.java:516)\n\tat > > > org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)\n\tat > > > > > > org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)\n\tat > > > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)\n\tat > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\n\tat > > > > > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat > > > org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\n\tat > > > org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\n\tat > > > > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)\n\tat > > > > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)\n\tat > > > > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)\n\tat > > > > > > org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)\n\tat > > > > > > org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)\n\tat > > > > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)\n\tat > > > > > > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)\n\tat > > > java.lang.Thread.run(Thread.java:748)\n", > > > > > > "code":500}} > > > > > > > > > I tried a few different workaround attempts, but after going through > > > these steps I wasn’t able to run another backup for the collection. > > > > > > > > > Workaround attempt 1: > > > > > > Use the API to delete the backup > > > > > > Used the API to purge unused backup files > > > > > > Restarted Solr > > > > > > Attempted another backup > > > > > > Encountered the same failure > > > > > > > > > Workaround attempt 2: > > > > > > Deleted all files in my Solr backup mount location > > > > > > Restarted Solr > > > > > > Attempted another backup > > > > > > Encountered the same failure > > > > > > > > > Thanks for your time, > > > > > > Jordan Diehl > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
