[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default

2020-05-07 Thread Mithun Radhakrishnan (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102093#comment-17102093
 ] 

Mithun Radhakrishnan commented on HADOOP-8143:
--

Sorry for the late reply. I am supportive of rolling this back. +1, non-binding.

The workaround I have suggested is unwieldy. And this change was not intended 
to mess up non-HDFS DistCp sources/targets.

bq. What made sense back then doesn't make sense now. 

Agreed, [~kihwal]. I suspect production DistCp jobs through Oozie DistCp 
Actions might already be preserving block-sizes.

Given that HDFS-13056 is in, DistCp should now be free to do CRC checks, 
without depending on matching HDFS block sizes.

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Fix For: 3.0.0-alpha4
>
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, 
> HADOOP-8143.3.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default

2020-04-02 Thread Mithun Radhakrishnan (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073938#comment-17073938
 ] 

Mithun Radhakrishnan commented on HADOOP-8143:
--

Thank you for pointing to HDFS-13056. This should address the crux of the 
problem, i.e. decoupling checksums from block-size. I have only perused it 
briefly, but the following section from the HDFS-13056's description is 
promising:
{quote}This option can be enabled or disabled at the granularity of individual 
client calls by setting the new configuration option 
`dfs.checksum.combine.mode` to `COMPOSITE_CRC`
{quote}
It appears that this doesn't require opt-in on HDFS/Name-node, and that 
querying for a file's checksum with 
{{`dfs.checksum.combine.mode=COMPOSITE_CRC`}} should return a CRC independent 
of block-size.

If this holds, perhaps DistCp should be changed to fetch CRCs thus, freeing us 
of requiring to preserve block-size for the sake of correctness. (It'll only 
hold on Hadoop 3.1.1+.)
{quote}At the very least, we need a way to turn this new default off. 
Especially when -skipCrcCheck is true.
{quote}
I'm a little rusty, but it surprises me that block-size preservation isn't 
turned off when {{`-skipCrcCheck && (!-pb)`}}. If this isn't so, then that's an 
oversight and needs fixing. As a workaround, specifying `-pu`, for instance, 
should disable block-size preservation.

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Fix For: 3.0.0-alpha4
>
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, 
> HADOOP-8143.3.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default

2017-06-20 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056578#comment-16056578
 ] 

Mithun Radhakrishnan commented on HADOOP-8143:
--

bq. What you proposed, sounds to be that -pb becomes a deprecated option 
because block size is always preserved.

Ah, yes. I see. I stand corrected. :] Your phrasing is more accurate. Thank you.

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Fix For: 3.0.0-alpha4
>
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, 
> HADOOP-8143.3.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default

2017-06-20 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056478#comment-16056478
 ] 

Mithun Radhakrishnan commented on HADOOP-8143:
--

bq. If -p option of distcp command is unspecified, block size is preserved.

That looks good. What about:
{noformat}
Block-size is preserved, even if the "-p" option of distcp command is 
unspecified.
{noformat}
?

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Fix For: 3.0.0-alpha4
>
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, 
> HADOOP-8143.3.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default

2017-06-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-8143:
-
Attachment: HADOOP-8143.3.patch

Sorry, just saw that. Here's the correction.

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, 
> HADOOP-8143.3.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default

2017-06-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-8143:
-
Status: Patch Available  (was: Open)

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, 
> HADOOP-8143.3.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default

2017-06-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-8143:
-
Status: Open  (was: Patch Available)

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default

2017-06-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-8143:
-
Status: Open  (was: Patch Available)

Re-submitting for tests.

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default

2017-06-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-8143:
-
Status: Patch Available  (was: Open)

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default

2017-06-19 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-8143:
-
Attachment: HADOOP-8143.2.patch

Rebased to work with changes on trunk.

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
> Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] (HADOOP-11794) distcp can copy blocks in parallel

2017-01-31 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847403#comment-15847403
 ] 

Mithun Radhakrishnan commented on HADOOP-11794:
---

Wow, this is really good work. (I'm continually astonished at how much DistCp 
has been improved upon and added to.)
Please forgive me, my DistCp-ese is a little rusty. I have a couple of minor 
questions:
# In {{DistCpUtils::toCopyListingFileStatus()}}, the javadoc says it 
{{"Converts a list of FileStatus to a list CopyListingFileStatus"}}. The method 
does not take a {{List}}. Shall we remove {{"list of"}}?
# Could we rephrase the doc to {{"Converts a `FileStatus` a list of 
`CopyListingFileStatus`. Returns either one CopyListingFileStatus per chunk of 
file-blocks (if file-size exceeds chunk-size), or one CopyListingFileStatus for 
the entire file (if file-size is too small to split)."}}?
# {{DistCpUtils::toCopyListingFileStatus()}} handles heterogeneous block-sizes 
via {{DFSClient.getBlockLocations()}}, but only if {{fileStatus.getLen() > 
fileStatus.getBlockSize()*chunkSize}}. Is it possible for an HDFS file with 
{{fileStatus.getBlockSize() == 256M}} to be composed entirely of tiny blocks 
(say 32MB)? Could we have a situation where a splittable file (with small 
blocks) ends up unsplit, because {{fileStatus.getBlockSize() >> 
effectiveBlockSize}}?
# I wonder if {{chunksize}} might be confused to be the "chunk-length in bytes" 
(like {{CopyListingFileStatus.chunkLength}}). I could be wrong, but would 
{{blocksPerChunk}} be less ambiguous? (Please ignore if this is too pervasive.)
# Nitpick: {{CopyListingFileStatus.toString()}} uses String concatenation 
inside a call to {{StringBuilder.apend()}}. (It was that way well before this 
patch. :/) Shall we replace this with a chain of {{.append()}} calls?
# In {{CopyCommitter::concatFileChunks()}}, could we please add additional 
logging for what files/chunks are being merged?

Thanks so much for working on this, [~yzhangal]. :]

> distcp can copy blocks in parallel
> --
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 0.21.0
>Reporter: dhruba borthakur
>Assignee: Yongjun Zhang
> Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, 
> HADOOP-11794.003.patch, MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14015) Partitions on Remote HDFS break encryption-zone checks

2017-01-20 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-14015:
--
Attachment: HADOOP-14015.1.patch

Here, I fetch the {{FileSystem}} instance appropriate for the data-path.

> Partitions on Remote HDFS break encryption-zone checks
> --
>
> Key: HADOOP-14015
> URL: https://issues.apache.org/jira/browse/HADOOP-14015
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.1.1-beta
>Reporter: Mithun Radhakrishnan
> Attachments: HADOOP-14015.1.patch
>
>
> This is in relation to HIVE-13243, which fixes encryption-zone checks for 
> external tables.
> Unfortunately, this is still borked for partitions with remote HDFS paths. 
> The code fails as follows:
> {noformat}
> 2015-12-09 19:26:14,997 ERROR [pool-4-thread-1476] server.TThreadPoolServer 
> (TThreadPoolServer.java:run_aroundBody0(305)) - Error occurred during 
> processing of message.
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://remote-cluster-nn1.myth.net:8020/dbs/mythdb/myth_table/dt=20170120, 
> expected: hdfs://local-cluster-n1.myth.net:8020
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1985)
> at 
> org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
> at 
> org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1290)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.checkTrashPurgeCombination(HiveMetaStore.java:1746)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partitions_req(HiveMetaStore.java:2974)
> at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
> at com.sun.proxy.$Proxy5.drop_partitions_req(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:10005)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:9989)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:767)
> at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:763)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:763)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody0(TThreadPoolServer.java:285)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody1$advice(TThreadPoolServer.java:101)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:1)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> I have a really simple fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14015) Partitions on Remote HDFS break encryption-zone checks

2017-01-20 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-14015:
--
Affects Version/s: 1.2.1
   2.1.1-beta
   Status: Patch Available  (was: Open)

> Partitions on Remote HDFS break encryption-zone checks
> --
>
> Key: HADOOP-14015
> URL: https://issues.apache.org/jira/browse/HADOOP-14015
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.1.1-beta, 1.2.1
>Reporter: Mithun Radhakrishnan
> Attachments: HADOOP-14015.1.patch
>
>
> This is in relation to HIVE-13243, which fixes encryption-zone checks for 
> external tables.
> Unfortunately, this is still borked for partitions with remote HDFS paths. 
> The code fails as follows:
> {noformat}
> 2015-12-09 19:26:14,997 ERROR [pool-4-thread-1476] server.TThreadPoolServer 
> (TThreadPoolServer.java:run_aroundBody0(305)) - Error occurred during 
> processing of message.
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://remote-cluster-nn1.myth.net:8020/dbs/mythdb/myth_table/dt=20170120, 
> expected: hdfs://local-cluster-n1.myth.net:8020
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1985)
> at 
> org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
> at 
> org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1290)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.checkTrashPurgeCombination(HiveMetaStore.java:1746)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partitions_req(HiveMetaStore.java:2974)
> at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
> at com.sun.proxy.$Proxy5.drop_partitions_req(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:10005)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:9989)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:767)
> at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:763)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:763)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody0(TThreadPoolServer.java:285)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody1$advice(TThreadPoolServer.java:101)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:1)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> I have a really simple fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14015) Partitions on Remote HDFS break encryption-zone checks

2017-01-20 Thread Mithun Radhakrishnan (JIRA)
Mithun Radhakrishnan created HADOOP-14015:
-

 Summary: Partitions on Remote HDFS break encryption-zone checks
 Key: HADOOP-14015
 URL: https://issues.apache.org/jira/browse/HADOOP-14015
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Mithun Radhakrishnan


This is in relation to HIVE-13243, which fixes encryption-zone checks for 
external tables.
Unfortunately, this is still borked for partitions with remote HDFS paths. The 
code fails as follows:

{noformat}
2015-12-09 19:26:14,997 ERROR [pool-4-thread-1476] server.TThreadPoolServer 
(TThreadPoolServer.java:run_aroundBody0(305)) - Error occurred during 
processing of message.
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://remote-cluster-nn1.myth.net:8020/dbs/mythdb/myth_table/dt=20170120, 
expected: hdfs://local-cluster-n1.myth.net:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1985)
at 
org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
at 
org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1290)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.checkTrashPurgeCombination(HiveMetaStore.java:1746)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partitions_req(HiveMetaStore.java:2974)
at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
at com.sun.proxy.$Proxy5.drop_partitions_req(Unknown Source)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:10005)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:9989)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:767)
at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:763)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:763)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody0(TThreadPoolServer.java:285)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody1$advice(TThreadPoolServer.java:101)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:1)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

I have a really simple fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default

2017-01-20 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832502#comment-15832502
 ] 

Mithun Radhakrishnan commented on HADOOP-8143:
--

It's that time of year again when one wonders whether this fix may be 
considered for submission. :] 
How about it, chaps?

> Change distcp to have -pb on by default
> ---
>
> Key: HADOOP-8143
> URL: https://issues.apache.org/jira/browse/HADOOP-8143
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Dave Thompson
>Assignee: Mithun Radhakrishnan
>Priority: Minor
>  Labels: BB2015-05-TBR
> Attachments: HADOOP-8143.1.patch
>
>
> We should have the preserve blocksize (-pb) on in distcp by default.
> checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-12473) distcp's ignoring failures option should be mutually exclusive with the atomic option

2015-12-22 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068624#comment-15068624
 ] 

Mithun Radhakrishnan commented on HADOOP-12473:
---

[~jira.shegalov], that is an interesting take. Hmm.

Between you and me, I think no one should be using {{-i}} at all, in atomic 
copies or otherwise. It was included to be backward compatible with DistCpV1, 
for those with an inexplicable tolerance for bad data. :]

{{-atomic}} was added so that users have the choice of staging their copies to 
a temp-location, before atomically moving them to the target location. I 
guessed there might be users who'd want to stage data before moving them, but 
could also tolerate bad copies. But I do see your point of view.

{{-i}} could be useful to work around annoying copy errors. For instance, there 
was a time when {{-skipCrc}} wouldn't work correctly, and copying files with 
different block-sizes (or empty files) would result in CRC failures. {{-i}} 
would let workflows complete while DistCp was under fix. Removing this makes 
the workaround unavailable when {{-atomic}} is used.

I'm on the fence here, but tending in your direction. I'd be happy to go along, 
if you could another "Aye!" from a committer. Paging [~jlowe] and [~daryn].



> distcp's ignoring failures option should be mutually exclusive with the 
> atomic option
> -
>
> Key: HADOOP-12473
> URL: https://issues.apache.org/jira/browse/HADOOP-12473
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.7.1
>Reporter: Mingliang Liu
>Assignee: Mingliang Liu
> Fix For: 2.8.0
>
>
> In {{CopyMapper::handleFailure}}, the mapper handles failure and will ignore 
> it if no it's config key is on. Ignoring failures option {{-i}} should be 
> mutually exclusive with the {{-atomic}} option otherwise an incomplete dir is 
> eligible for commit defeating the purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel

2015-12-21 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067007#comment-15067007
 ] 

Mithun Radhakrishnan commented on HADOOP-11794:
---

[~yzhangal],

bq. Appreciate your excellent work!
You're too kind. :]

bq. But I'm making it more flexible here, such that we can support variable 
number blocks per split.
I agree with the principle of what you're suggesting. Combining multiple splits 
into a larger split (based on size) is a problem that 
{{CombineFileInputFormat}} provides a solution for. Do you think we can use 
{{CombineFileInputFormat}} to combine block-level splits into a larger split?

bq. We need some new client-namenode API protocol to get back the locatedBlocks 
for the specified block range...
Hmm... Do we? DistCp copies whole files (even if at a split level). Since we 
can retrieve located blocks for all blocks in the file, shouldn't that be 
enough? We could group locatedBlocks by block-id. Perhaps I'm missing something.


> distcp can copy blocks in parallel
> --
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 0.21.0
>Reporter: dhruba borthakur
>Assignee: Yongjun Zhang
> Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-12469) distcp should not ignore the ignoreFailures option

2015-12-21 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067171#comment-15067171
 ] 

Mithun Radhakrishnan commented on HADOOP-12469:
---

Ah, I see what you did there. +1.

> distcp should not ignore the ignoreFailures option
> --
>
> Key: HADOOP-12469
> URL: https://issues.apache.org/jira/browse/HADOOP-12469
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.7.1
>Reporter: Gera Shegalov
>Assignee: Mingliang Liu
>Priority: Critical
> Fix For: 2.8.0
>
> Attachments: HADOOP-12469.000.patch, HADOOP-12469.001.patch
>
>
> {{RetriableFileCopyCommand.CopyReadException}} is double-wrapped via
> # via {{RetriableCommand::execute}}
> # via {{CopyMapper#copyFileWithRetry}}
> before {{CopyMapper::handleFailure}} tests 
> {code}
> if (ignoreFailures && exception.getCause() instanceof
> RetriableFileCopyCommand.CopyReadException
> {code}
> which is always false.
> Orthogonally, ignoring failures should be mutually exclusive with the atomic 
> option otherwise an incomplete dir is eligible for commit defeating the 
> purpose.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel

2015-12-21 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067227#comment-15067227
 ] 

Mithun Radhakrishnan commented on HADOOP-11794:
---

Ah, I finally see. That makes complete sense. Thank you for the pointer to the 
JIRA.

Also, {{CombineFileInputFormat}} might work with {{UniformSizeInputFormat}}, 
but it might not with {{DynamicInputFormat}}. Maybe combining a configurable 
number of blocks (ranges) into splits would be easier to work with.

I see what you're doing, and I agree.

> distcp can copy blocks in parallel
> --
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 0.21.0
>Reporter: dhruba borthakur
>Assignee: Yongjun Zhang
> Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-12473) distcp's ignoring failures should be mutually exclusive with the atomic option

2015-12-21 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067214#comment-15067214
 ] 

Mithun Radhakrishnan commented on HADOOP-12473:
---

[~jira.shegalov], [~liuml07], could you please explain the reasoning behind 
this fix?

If I understand correctly, Gera's orthogonal suggestion in HADOOP-12469 was to 
make {{-atomic}} and {{-i}} mutually exclusive. The latest patch in this JIRA 
doesn't seem to address this concern, AFAICT. It makes {{ignoreFailures}} an 
{{AtomicBoolean}}, which is not what Gera was getting at, I believe.

Also, [~jira.shegalov], why do you recommend that {{-atomic}} and {{-i}} be 
mutually exclusive? Aren't they orthogonal concerns? Why consider {{-atomic}} 
as incapable of ignoring copy-errors?



> distcp's ignoring failures should be mutually exclusive with the atomic option
> --
>
> Key: HADOOP-12473
> URL: https://issues.apache.org/jira/browse/HADOOP-12473
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.7.1
>Reporter: Mingliang Liu
>Assignee: Mingliang Liu
> Fix For: 2.8.0
>
> Attachments: HADOOP-12473.000.patch, HADOOP-12473.001.patch, 
> HADOOP-12473.002.patch
>
>
> In {{CopyMapper::handleFailure}}, the mapper handles failure and will ignore 
> it if no it's config key is on. Ignoring failures should be mutually 
> exclusive with the atomic option otherwise an incomplete dir is eligible for 
> commit defeating the purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel

2015-12-21 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067142#comment-15067142
 ] 

Mithun Radhakrishnan commented on HADOOP-11794:
---

bq. My argument is that fetching all block locations for a file is not as 
efficient as fetching only the block range the mapper is assigned to work on.

Thank you for explaining. Let me see if I can phrase my questions more clearly 
than before:

# Would it make sense to include the block-locations within the splits, at the 
time of split-calculation, instead of the block-ranges? If yes, then we can 
make do with the API we already have, by fetching locatedBlocks for all files, 
and grouping them among the DistCp splits. (It is indeed possible that keeping 
ranges, and using your proposed API on the map-side might be faster. But those 
map-side calls might possibly also exert more parallel load on the name-node, 
depending on the number of maps.)

# Naive question: Why do we need to identify locatedBlocks? Don't HDFS files 
have uniformly sized blocks (within a file)? As such, aren't the 
block-boundaries implicit (i.e. from {{blockId*blockSize}} to 
{{(blockId+1)*(blockSize) - 1}})? Can't we simply copy that range of bytes into 
a new file (and stitch the new files in reduce)?

> distcp can copy blocks in parallel
> --
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 0.21.0
>Reporter: dhruba borthakur
>Assignee: Yongjun Zhang
> Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel

2015-12-06 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044105#comment-15044105
 ] 

Mithun Radhakrishnan commented on HADOOP-11794:
---

[~yzhangal]: Thank you, sir. Please do. Hive has kept me busy enough not to 
devote time here. I'd be happy to review your work.

I had a patch a couple of years ago which split files on block-boundaries, 
copied them over, and then stitched them together using 
{{DistributedFileSystem.concat()}} in a reduce-step. If I can find the patch, 
I'll ping it to you, but it's not terribly hard to do this from scratch. The 
prototype had very promising performance.

I look forward to your solution.

> distcp can copy blocks in parallel
> --
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 0.21.0
>Reporter: dhruba borthakur
>Assignee: Mithun Radhakrishnan
> Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-11794) distcp can copy blocks in parallel

2015-12-06 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-11794:
--
Assignee: Yongjun Zhang  (was: Mithun Radhakrishnan)

> distcp can copy blocks in parallel
> --
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 0.21.0
>Reporter: dhruba borthakur
>Assignee: Yongjun Zhang
> Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel

2015-12-06 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044373#comment-15044373
 ] 

Mithun Radhakrishnan commented on HADOOP-11794:
---

Sorry, no. That's likely [~dhruba]'s work, which might have been based on the 
DistCp-v1 code. We'll need new code for the DistCp-v2 code (i.e. my rewrite 
from MAPREDUCE-2765).

Apologies if you've already thought this through. One would need to change the 
{{DynamicInputFormat#createSplits()}} implementation, which currently looks 
thus:

{code:java:borderStyle=solid:title=DynamicInputFormat.java}
  private List createSplits(JobContext jobContext,
List chunks)
  throws IOException {
int numMaps = getNumMapTasks(jobContext.getConfiguration());

final int nSplits = Math.min(numMaps, chunks.size());
List splits = new ArrayList(nSplits);

for (int i=0; i< nSplits; ++i) {
  TaskID taskId = new TaskID(jobContext.getJobID(), TaskType.MAP, i);
  chunks.get(i).assignTo(taskId);
  splits.add(new FileSplit(chunks.get(i).getPath(), 0,
  // Setting non-zero length for FileSplit size, to avoid a possible
  // future when 0-sized file-splits are considered "empty" and skipped
  // over.
  getMinRecordsPerChunk(jobContext.getConfiguration()),
  null));
}
DistCpUtils.publish(jobContext.getConfiguration(),
CONF_LABEL_NUM_SPLITS, splits.size());
return splits;
  }
{code}

You'll need to create a {{FileSplit}} per file-block (by first examining the 
file's block-size). The mappers will now need to emit something like 
{{(relativePathForOriginalSourceFile, targetLocation_with_block_number)}}. By 
keying on the relative-source-paths (+ expected number of blocks), you can get 
all the target-block-locations to hit the same reducer, where you can stitch 
them together. 

Good luck. :]

> distcp can copy blocks in parallel
> --
>
> Key: HADOOP-11794
> URL: https://issues.apache.org/jira/browse/HADOOP-11794
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 0.21.0
>Reporter: dhruba borthakur
>Assignee: Yongjun Zhang
> Attachments: MAPREDUCE-2257.patch
>
>
> The minimum unit of work for a distcp task is a file. We have files that are 
> greater than 1 TB with a block size of  1 GB. If we use distcp to copy these 
> files, the tasks either take a long long long time or finally fails. A better 
> way for distcp would be to copy all the source blocks in parallel, and then 
> stich the blocks back to files at the destination via the HDFS Concat API 
> (HDFS-222)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default

2014-10-30 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190320#comment-14190320
 ] 

Mithun Radhakrishnan commented on HADOOP-8143:
--

Hh, actually, thank you for correcting me, Allen. You're right, the 
data-corruption part is a clumsily worded overstatement, and not 
ORC-specific. 

1. Checksum-verifications between source and target are guaranteed to fail 
between files with identical contents, but different block-sizes (and span 
blocks). If HDFS has been working to fix this, do let me know of the JIRA. The 
only way to have DistCp succeed in copying them is to skip checksums. And this 
raises the potential for bad copies of the file, regardless of format.

2. There's potential for performance degradation when ORC files with large 
stripes are copied to clusters with smaller block-sizes, if block-sizes aren't 
preserved.

While #2 is of some concern, #1 is of maximum import. 

 Change distcp to have -pb on by default
 ---

 Key: HADOOP-8143
 URL: https://issues.apache.org/jira/browse/HADOOP-8143
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Dave Thompson
Assignee: Mithun Radhakrishnan
Priority: Minor
 Attachments: HADOOP-8143.1.patch


 We should have the preserve blocksize (-pb) on in distcp by default.
 checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HADOOP-8143) Change distcp to have -pb on by default

2014-10-29 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan reopened HADOOP-8143:
--

Chaps, would it be ok if we revisited ask?

1. [~davet]'s original problem remains, i.e. copying files between 2 clusters 
with different default block-sizes will fail, without either -pb or -skipCrc. 
HADOOP-8233 only solves this for 0-byte files.

2. File-formats such as ORC perform several optimizations w.r.t. data-stripes 
and HDFS-block-sizes. If such files were to be copied between clusters without 
preserving block-sizes, there would ensue performance-fails (at best) or 
data-corruptions (at worst).

Would it be acceptable to preserve block-sizes by default (i.e. if -p isn't 
used), only if the source and target file-systems are HDFS?

 Change distcp to have -pb on by default
 ---

 Key: HADOOP-8143
 URL: https://issues.apache.org/jira/browse/HADOOP-8143
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Dave Thompson
Assignee: Dave Thompson
Priority: Minor

 We should have the preserve blocksize (-pb) on in distcp by default.
 checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default

2014-10-29 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-8143:
-
Attachment: HADOOP-8143.1.patch

Tentative fix. This preserves block-size by default (but only if -p isn't 
specified at all). This assumes that if the user said {{-pug}}, then block-size 
was deliberately left out.

 Change distcp to have -pb on by default
 ---

 Key: HADOOP-8143
 URL: https://issues.apache.org/jira/browse/HADOOP-8143
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Dave Thompson
Assignee: Dave Thompson
Priority: Minor
 Attachments: HADOOP-8143.1.patch


 We should have the preserve blocksize (-pb) on in distcp by default.
 checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default

2014-10-29 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan updated HADOOP-8143:
-
Status: Patch Available  (was: Reopened)

 Change distcp to have -pb on by default
 ---

 Key: HADOOP-8143
 URL: https://issues.apache.org/jira/browse/HADOOP-8143
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Dave Thompson
Assignee: Dave Thompson
Priority: Minor
 Attachments: HADOOP-8143.1.patch


 We should have the preserve blocksize (-pb) on in distcp by default.
 checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HADOOP-8143) Change distcp to have -pb on by default

2014-10-29 Thread Mithun Radhakrishnan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mithun Radhakrishnan reassigned HADOOP-8143:


Assignee: Mithun Radhakrishnan  (was: Dave Thompson)

 Change distcp to have -pb on by default
 ---

 Key: HADOOP-8143
 URL: https://issues.apache.org/jira/browse/HADOOP-8143
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Dave Thompson
Assignee: Mithun Radhakrishnan
Priority: Minor
 Attachments: HADOOP-8143.1.patch


 We should have the preserve blocksize (-pb) on in distcp by default.
 checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default

2014-10-29 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189345#comment-14189345
 ] 

Mithun Radhakrishnan commented on HADOOP-8143:
--

[~aw]
bq. forcing block size will break non-HDFS methods in surprising ways.

Here's the code in DistCp that is affected by preserving block-size:
{code:java}
  private static long getBlockSize(
  EnumSetFileAttribute fileAttributes,
  FileStatus sourceFile, FileSystem targetFS, Path tmpTargetPath) {
boolean preserve = fileAttributes.contains(FileAttribute.BLOCKSIZE)
|| fileAttributes.contains(FileAttribute.CHECKSUMTYPE);
return preserve ? sourceFile.getBlockSize() : targetFS
.getDefaultBlockSize(tmpTargetPath);
  }
{code}

Would the concern be that {{FileStatus.getBlockSize()}} might conk if the 
source-file isn't on HDFS? It's more likely that 
{{FileSystem.getDefaultBlockSize()}} is being called for a non-HDFS file-system 
as well, by default. 

 Change distcp to have -pb on by default
 ---

 Key: HADOOP-8143
 URL: https://issues.apache.org/jira/browse/HADOOP-8143
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Dave Thompson
Assignee: Mithun Radhakrishnan
Priority: Minor
 Attachments: HADOOP-8143.1.patch


 We should have the preserve blocksize (-pb) on in distcp by default.
 checksum which is on by default will always fail if blocksize is not the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HADOOP-10129) Distcp may succeed when it fails

2013-11-26 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-10129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832934#comment-13832934
 ] 

Mithun Radhakrishnan commented on HADOOP-10129:
---

Thanks, Daryn. FWIW, the patch looks good. +1.

Sorry, I thought this was already resolved. I could've sworn I'd posted a fix 
on another JIRA to mimick the Y!Internal version's behaviour (i.e. call 
outstream.close() explicitly).

 Distcp may succeed when it fails
 

 Key: HADOOP-10129
 URL: https://issues.apache.org/jira/browse/HADOOP-10129
 Project: Hadoop Common
  Issue Type: Bug
  Components: tools/distcp
Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
 Attachments: HADOOP-10129.patch


 Distcp uses {{IOUtils.cleanup}} to close its output streams w/o previously 
 attempting to close the streams.  {{IOUtils.cleanup}} will swallow close or 
 implicit flush on close exceptions.  As a result, distcp may silently skip 
 files when a partial file listing is generated, and/or appear to succeed when 
 individual copies fail.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-8225) DistCp fails when invoked by Oozie

2012-07-27 Thread Mithun Radhakrishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424154#comment-13424154
 ] 

Mithun Radhakrishnan commented on HADOOP-8225:
--

Hello, Daryn. Thanks so much for reviewing the fix.

I'm pretty sure that DistCp handles tokens correctly. This code-path was 
introduced purely for the case where DistCp is invoked via Oozie. I wish there 
were another way to transfer delegation-tokens from Oozie's launcher over to 
DistCp. (This is also the way Pig and Hive actions work in Oozie.)

 DistCp fails when invoked by Oozie
 --

 Key: HADOOP-8225
 URL: https://issues.apache.org/jira/browse/HADOOP-8225
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 0.23.1
Reporter: Mithun Radhakrishnan
 Attachments: HADOOP-8225.patch, HADOOP-8225.patch


 When DistCp is invoked through a proxy-user (e.g. through Oozie), the 
 delegation-token-store isn't picked up by DistCp correctly. One sees failures 
 such as:
 ERROR [main] org.apache.hadoop.tools.DistCp: Couldn't complete DistCp
 operation: 
 java.lang.SecurityException: Intercepted System.exit(-999)
 at
 org.apache.oozie.action.hadoop.LauncherSecurityManager.checkExit(LauncherMapper.java:651)
 at java.lang.Runtime.exit(Runtime.java:88)
 at java.lang.System.exit(System.java:904)
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:357)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:394)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142)
 Looking over the DistCp code, one sees that HADOOP_TOKEN_FILE_LOCATION isn't 
 being copied to mapreduce.job.credentials.binary, in the job-conf. I'll post 
 a patch for this shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira