[ 
https://issues.apache.org/jira/browse/HDFS-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234627#comment-16234627
 ] 

Logesh Rangan commented on HDFS-12753:
--------------------------------------

But our production environment doesn't offer a Dynamo DB instance for S3 Guard. 
Is there a way to tune the options for distcp to copy the huge files. I'm 
looking for below information,

1) How to select the number of  map and it's size. I have a directory which has 
~10000+ files with total size of ~250 GB. When I run with below option, it is 
taking ~1.30 hours.

hadoop distcp -D HADOOP_OPTS=-Xmx12g -D HADOOP_CLIENT_OPTS='-Xmx12g 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 
-XX:+CMSParallelRemarkEnabled' -D 'mapreduce.map.memory.mb=12288' -D 
'mapreduce.map.java.opts=-Xmx10g' -D 'mapreduce.reduce.memory.mb=12288' -D 
'mapreduce.reduce.java.opts=-Xmx10g' 
'-Dfs.s3a.proxy.host=edhmgrn-prod.cloud.capitalone.com' 
'-Dfs.s3a.proxy.port=8088' '-Dfs.s3a.access.key=XXXXXXX' 
'-Dfs.s3a.secret.key=XXXXXXX' '-Dfs.s3a.connection.timeout=180000' 
'-Dfs.s3a.attempts.maximum=5' '-Dfs.s3a.fast.upload=true' 
'-Dfs.s3a.fast.upload.buffer=array' '-Dfs.s3a.fast.upload.active.blocks=50' 
'-Dfs.s3a.multipart.size=262144000' '-Dfs.s3a.threads.max=500' 
'-Dfs.s3a.threads.keepalivetime=600' 
'-Dfs.s3a.server-side-encryption-algorithm=AES256' -bandwidth 3072 -strategy 
dynamic -m 200 -numListstatusThreads 30 /src/ s3a://bucket/dest

2) I'm not seeing the throughput of 3gbps even after configuring the -bandwidth 
as 3072. 

3) How to configure the Java heap and map size for the huge file, so that 
distcp will give better performance.

4) WIth fast upload option, I'm writing the files to S3 using threads. Could 
you please help me in providing some tuning option for this.

Appreciate Your Help.

> Getting file not found exception while using distcp with s3a
> ------------------------------------------------------------
>
>                 Key: HDFS-12753
>                 URL: https://issues.apache.org/jira/browse/HDFS-12753
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Logesh Rangan
>
> I'm using the distcp option to copy the huge files from Hadoop to S3. 
> Sometimes i'm getting the below error,
> *Command:* (Copying 378 GB data)
> _hadoop distcp -D HADOOP_OPTS=-Xmx12g -D HADOOP_CLIENT_OPTS='-Xmx12g 
> -XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled' -D 
> 'mapreduce.map.memory.mb=12288' -D 'mapreduce.map.java.opts=-Xmx10g' -D 
> 'mapreduce.reduce.memory.mb=12288' -D 'mapreduce.reduce.java.opts=-Xmx10g' 
> '-Dfs.s3a.proxy.host=edhmgrn-prod.cloud.capitalone.com' 
> '-Dfs.s3a.proxy.port=8088' '-Dfs.s3a.access.key=XXXXXXX' 
> '-Dfs.s3a.secret.key=XXXXXXX' '-Dfs.s3a.connection.timeout=180000' 
> '-Dfs.s3a.attempts.maximum=5' '-Dfs.s3a.fast.upload=true' 
> '-Dfs.s3a.fast.upload.buffer=array' '-Dfs.s3a.fast.upload.active.blocks=50' 
> '-Dfs.s3a.multipart.size=262144000' '-Dfs.s3a.threads.max=500' 
> '-Dfs.s3a.threads.keepalivetime=600' 
> '-Dfs.s3a.server-side-encryption-algorithm=AES256' -bandwidth 3072 -strategy 
> dynamic -m 220 -numListstatusThreads 30 /src/ s3a://bucket/dest
> _
> 17/11/01 12:23:27 INFO mapreduce.Job: Task Id : 
> attempt_1497120915913_2792335_m_000165_0, Status : FAILED
> Error: java.io.FileNotFoundException: No such file or directory: 
> s3a://bucketname/filename
>         at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1132)
>         at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:78)
>         at 
> org.apache.hadoop.tools.util.DistCpUtils.preserve(DistCpUtils.java:197)
>         at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:256)
>         at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1912)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> 17/11/01 12:28:32 INFO mapreduce.Job: Task Id : 
> attempt_1497120915913_2792335_m_000010_0, Status : FAILED
> Error: java.io.IOException: File copy failed: hdfs://nameservice1/filena --> 
> s3a://cof-prod-lake-card/src/seam/acct_scores/acctmdlscore_card_cobna_anon_vldtd/instnc_id=20161023000000/000004_0_copy_6
>         at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:284)
>         at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:252)
>         at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1912)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.io.IOException: Couldn't run retriable-command: Copying 
> hdfs://nameservice1/filename to s3a://bucketname/filename
>         at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
>         at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:280)
>         ... 10 more
> Caused by: com.cloudera.com.amazonaws.AmazonClientException: Failed to parse 
> XML document with handler class 
> com.cloudera.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
>         at 
> com.cloudera.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:164)
>         at 
> com.cloudera.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:299)
>         at 
> com.cloudera.com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:77)
>         at 
> com.cloudera.com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:74)
>         at 
> com.cloudera.com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
>         at 
> com.cloudera.com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
>         at 
> com.cloudera.com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:1072)
>         at 
> com.cloudera.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:746)
>         at 
> com.cloudera.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
>         at 
> com.cloudera.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
>         at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
>         at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3738)
>         at 
> com.cloudera.com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:653)
>         at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1096)
>         at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1279)
>         at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1268)
>         at 
> org.apache.hadoop.fs.s3a.S3AFastOutputStream.close(S3AFastOutputStream.java:257)
>         at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>         at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
>         at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyBytes(RetriableFileCopyCommand.java:261)
>         at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.copyToFile(RetriableFileCopyCommand.java:184)
>         at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:124)
>         at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
>         at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
>         ... 11 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 2; XML 
> document structures must start and end within the same entity.
>         at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>         at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>         at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>         at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>         at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>         at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown 
> Source)
>         at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source)
>         at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
>         at org.apache.xerces.impl.XMLEntityScanner.skipChar(Unknown Source)
>         at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown
>  Source)
>         at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>         at 
> com.cloudera.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:151)
>         ... 35 more
> And also please help me in choosing the number of mappers and what should I 
> do to copy the data faster to S3.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to