[
https://issues.apache.org/jira/browse/IMPALA-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong updated IMPALA-3607:
----------------------------------
Issue Type: Improvement (was: Bug)
> Reduce test data loading time from snapshot
> -------------------------------------------
>
> Key: IMPALA-3607
> URL: https://issues.apache.org/jira/browse/IMPALA-3607
> Project: IMPALA
> Issue Type: Improvement
> Components: Infrastructure
> Affects Versions: Impala 2.5.0
> Reporter: Dimitris Tsirogiannis
> Priority: Minor
> Labels: test-infra
>
> Loading test data from snapshot takes a significant amount of time
> (~20-30min). Given the amount of data loaded (~4GB), the process of loading
> test data to a local 3-node min-hdfs cluster should be significantly faster.
> The process currently works as follows:
> 1. Download the latest snapshot
> 2. Unzip
> 3. Use hdfs dfs -put command to copy from local file system to hdfs
> We believe the bulk of the time goes to step #3 and is attributed to namenode
> overhead. Below are a few ideas we can try to improve this:
> 1. Use a backup and restore approach for hdfs metadata/data that doesn't go
> through the namenode. For example, once data is loaded to an hdfs cluster
> using the old approach create two snapshots, one for metadata and one for
> data. Loading the test data is just a matter of unzipping the snapshots to
> the appropriate directories. A similar approach is used to backup and restore
> hdfs clusters
> (http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_hdfs_metadata_backup.html).
> A jenkins job would still be responsible for checking for changes in test
> data, do the slow data loading and creating the new snapshots.
> 2. Other ideas include the use of EC2 AMIs, docker and/or hdfs checkpointing.
> 3. Use faster compression/decompression tools.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]