[ https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822190#comment-15822190 ]
Timothy Potter commented on SOLR-9961: -------------------------------------- The other thing I found here is HdfsDirectory is closing a shared FileSystem object because HdfsBackupRepository uses try with resources: {code} @Override public void copyFileTo(URI sourceRepo, String fileName, Directory dest) throws IOException { try (HdfsDirectory dir = new HdfsDirectory(new Path(sourceRepo), NoLockFactory.INSTANCE, hdfsConfig, HdfsDirectory.DEFAULT_BUFFER_SIZE * 10)) { dest.copyFrom(dir, fileName, fileName, DirectoryFactory.IOCONTEXT_NO_CACHE); } } {code} This closes the FileSystem object that was retrieved with FileSystem.get. Because of this (I think), I'm seeing lots of errors like the following while doing the restore: {code} WARN - 2017-01-13 14:09:44.249; [ ] org.apache.solr.handler.RestoreCore; Exception while restoring the backup index java.lang.RuntimeException: Problem creating directory: gs://hd-fusion/aggr_solr/myAggr3/snapshot.shard1 at org.apache.solr.store.hdfs.HdfsDirectory.<init>(HdfsDirectory.java:91) at org.apache.solr.core.backup.repository.HdfsBackupRepository.copyFileTo(HdfsBackupRepository.java:175) at org.apache.solr.handler.RestoreCore.downloadFile(RestoreCore.java:196) at org.apache.solr.handler.RestoreCore.access$000(RestoreCore.java:47) at org.apache.solr.handler.RestoreCore$1.call(RestoreCore.java:101) at org.apache.solr.handler.RestoreCore$1.call(RestoreCore.java:99) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: GoogleHadoopFileSystem has been closed or not initialized. at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.checkOpen(GoogleHadoopFileSystemBase.java:1802) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1284) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424) at org.apache.solr.store.hdfs.HdfsDirectory.<init>(HdfsDirectory.java:83) ... 9 more {code} There's a handy prop that allows you to disable the cache (add to core-site.xml), which makes this error go away: {code} <property> <name>fs.gs.impl.disable.cache</name> <value>true</value> </property> {code} > RestoreCore needs the option to download files in parallel. > ----------------------------------------------------------- > > Key: SOLR-9961 > URL: https://issues.apache.org/jira/browse/SOLR-9961 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Backup/Restore > Affects Versions: 6.2.1 > Reporter: Timothy Potter > Attachments: SOLR-9961.patch > > > My backup to cloud storage (Google cloud storage in this case, but I think > this is a general problem) takes 8 minutes ... the restore of the same core > takes hours. The restore loop in RestoreCore is serial and doesn't allow me > to parallelize the expensive part of this operation (the IO from the remote > cloud storage service). We need the option to parallelize the download (like > distcp). > Also, I tried downloading the same directory using gsutil and it was very > fast, like 2 minutes. So I know it's not the pipe that's limiting perf here. > Here's a very rough patch that does the parallelization. We may also want to > consider a two-step approach: 1) download in parallel to a temp dir, 2) > perform all the of the checksum validation against the local temp dir. That > will save round trips to the remote cloud storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org