[jira] [Commented] (SOLR-9961) RestoreCore needs the option to download files in parallel.

Timothy Potter (JIRA) Fri, 13 Jan 2017 11:12:06 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822190#comment-15822190
 ]


Timothy Potter commented on SOLR-9961:
--------------------------------------

The other thing I found here is HdfsDirectory is closing a shared FileSystem 
object because HdfsBackupRepository uses try with resources:

{code}
  @Override
  public void copyFileTo(URI sourceRepo, String fileName, Directory dest) 
throws IOException {
    try (HdfsDirectory dir = new HdfsDirectory(new Path(sourceRepo), 
NoLockFactory.INSTANCE,
        hdfsConfig, HdfsDirectory.DEFAULT_BUFFER_SIZE * 10)) {
      dest.copyFrom(dir, fileName, fileName, 
DirectoryFactory.IOCONTEXT_NO_CACHE);
    }
  }
{code}

This closes the FileSystem object that was retrieved with FileSystem.get. 
Because of this (I think), I'm seeing lots of errors like the following while 
doing the restore:
{code}
WARN  - 2017-01-13 14:09:44.249; [   ] org.apache.solr.handler.RestoreCore; 
Exception while restoring the backup index 
java.lang.RuntimeException: Problem creating directory: 
gs://hd-fusion/aggr_solr/myAggr3/snapshot.shard1
        at 
org.apache.solr.store.hdfs.HdfsDirectory.<init>(HdfsDirectory.java:91)
        at 
org.apache.solr.core.backup.repository.HdfsBackupRepository.copyFileTo(HdfsBackupRepository.java:175)
        at 
org.apache.solr.handler.RestoreCore.downloadFile(RestoreCore.java:196)
        at org.apache.solr.handler.RestoreCore.access$000(RestoreCore.java:47)
        at org.apache.solr.handler.RestoreCore$1.call(RestoreCore.java:101)
        at org.apache.solr.handler.RestoreCore$1.call(RestoreCore.java:99)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: GoogleHadoopFileSystem has been closed or not 
initialized.
        at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.checkOpen(GoogleHadoopFileSystemBase.java:1802)
        at 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1284)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
        at 
org.apache.solr.store.hdfs.HdfsDirectory.<init>(HdfsDirectory.java:83)
        ... 9 more
{code}

There's a handy prop that allows you to disable the cache (add to 
core-site.xml), which makes this error go away:
{code}
  <property>
    <name>fs.gs.impl.disable.cache</name>
    <value>true</value>
  </property>
{code}

> RestoreCore needs the option to download files in parallel.
> -----------------------------------------------------------
>
>                 Key: SOLR-9961
>                 URL: https://issues.apache.org/jira/browse/SOLR-9961
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore
>    Affects Versions: 6.2.1
>            Reporter: Timothy Potter
>         Attachments: SOLR-9961.patch
>
>
> My backup to cloud storage (Google cloud storage in this case, but I think 
> this is a general problem) takes 8 minutes ... the restore of the same core 
> takes hours. The restore loop in RestoreCore is serial and doesn't allow me 
> to parallelize the expensive part of this operation (the IO from the remote 
> cloud storage service). We need the option to parallelize the download (like 
> distcp). 
> Also, I tried downloading the same directory using gsutil and it was very 
> fast, like 2 minutes. So I know it's not the pipe that's limiting perf here.
> Here's a very rough patch that does the parallelization. We may also want to 
> consider a two-step approach: 1) download in parallel to a temp dir, 2) 
> perform all the of the checksum validation against the local temp dir. That 
> will save round trips to the remote cloud storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-9961) RestoreCore needs the option to download files in parallel.

Reply via email to