[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102093#comment-17102093 ] Mithun Radhakrishnan commented on HADOOP-8143: -- Sorry for the late reply. I am supportive of rolling this back. +1, non-binding. The workaround I have suggested is unwieldy. And this change was not intended to mess up non-HDFS DistCp sources/targets. bq. What made sense back then doesn't make sense now. Agreed, [~kihwal]. I suspect production DistCp jobs through Oozie DistCp Actions might already be preserving block-sizes. Given that HDFS-13056 is in, DistCp should now be free to do CRC checks, without depending on matching HDFS block sizes. > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Fix For: 3.0.0-alpha4 > > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, > HADOOP-8143.3.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073938#comment-17073938 ] Mithun Radhakrishnan commented on HADOOP-8143: -- Thank you for pointing to HDFS-13056. This should address the crux of the problem, i.e. decoupling checksums from block-size. I have only perused it briefly, but the following section from the HDFS-13056's description is promising: {quote}This option can be enabled or disabled at the granularity of individual client calls by setting the new configuration option `dfs.checksum.combine.mode` to `COMPOSITE_CRC` {quote} It appears that this doesn't require opt-in on HDFS/Name-node, and that querying for a file's checksum with {{`dfs.checksum.combine.mode=COMPOSITE_CRC`}} should return a CRC independent of block-size. If this holds, perhaps DistCp should be changed to fetch CRCs thus, freeing us of requiring to preserve block-size for the sake of correctness. (It'll only hold on Hadoop 3.1.1+.) {quote}At the very least, we need a way to turn this new default off. Especially when -skipCrcCheck is true. {quote} I'm a little rusty, but it surprises me that block-size preservation isn't turned off when {{`-skipCrcCheck && (!-pb)`}}. If this isn't so, then that's an oversight and needs fixing. As a workaround, specifying `-pu`, for instance, should disable block-size preservation. > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Fix For: 3.0.0-alpha4 > > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, > HADOOP-8143.3.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056578#comment-16056578 ] Mithun Radhakrishnan commented on HADOOP-8143: -- bq. What you proposed, sounds to be that -pb becomes a deprecated option because block size is always preserved. Ah, yes. I see. I stand corrected. :] Your phrasing is more accurate. Thank you. > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Fix For: 3.0.0-alpha4 > > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, > HADOOP-8143.3.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16056478#comment-16056478 ] Mithun Radhakrishnan commented on HADOOP-8143: -- bq. If -p option of distcp command is unspecified, block size is preserved. That looks good. What about: {noformat} Block-size is preserved, even if the "-p" option of distcp command is unspecified. {noformat} ? > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Fix For: 3.0.0-alpha4 > > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, > HADOOP-8143.3.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-8143: - Attachment: HADOOP-8143.3.patch Sorry, just saw that. Here's the correction. > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, > HADOOP-8143.3.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-8143: - Status: Patch Available (was: Open) > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch, > HADOOP-8143.3.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-8143: - Status: Open (was: Patch Available) > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-8143: - Status: Open (was: Patch Available) Re-submitting for tests. > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-8143: - Status: Patch Available (was: Open) > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-8143: - Attachment: HADOOP-8143.2.patch Rebased to work with changes on trunk. > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Attachments: HADOOP-8143.1.patch, HADOOP-8143.2.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847403#comment-15847403 ] Mithun Radhakrishnan commented on HADOOP-11794: --- Wow, this is really good work. (I'm continually astonished at how much DistCp has been improved upon and added to.) Please forgive me, my DistCp-ese is a little rusty. I have a couple of minor questions: # In {{DistCpUtils::toCopyListingFileStatus()}}, the javadoc says it {{"Converts a list of FileStatus to a list CopyListingFileStatus"}}. The method does not take a {{List}}. Shall we remove {{"list of"}}? # Could we rephrase the doc to {{"Converts a `FileStatus` a list of `CopyListingFileStatus`. Returns either one CopyListingFileStatus per chunk of file-blocks (if file-size exceeds chunk-size), or one CopyListingFileStatus for the entire file (if file-size is too small to split)."}}? # {{DistCpUtils::toCopyListingFileStatus()}} handles heterogeneous block-sizes via {{DFSClient.getBlockLocations()}}, but only if {{fileStatus.getLen() > fileStatus.getBlockSize()*chunkSize}}. Is it possible for an HDFS file with {{fileStatus.getBlockSize() == 256M}} to be composed entirely of tiny blocks (say 32MB)? Could we have a situation where a splittable file (with small blocks) ends up unsplit, because {{fileStatus.getBlockSize() >> effectiveBlockSize}}? # I wonder if {{chunksize}} might be confused to be the "chunk-length in bytes" (like {{CopyListingFileStatus.chunkLength}}). I could be wrong, but would {{blocksPerChunk}} be less ambiguous? (Please ignore if this is too pervasive.) # Nitpick: {{CopyListingFileStatus.toString()}} uses String concatenation inside a call to {{StringBuilder.apend()}}. (It was that way well before this patch. :/) Shall we replace this with a chain of {{.append()}} calls? # In {{CopyCommitter::concatFileChunks()}}, could we please add additional logging for what files/chunks are being merged? Thanks so much for working on this, [~yzhangal]. :] > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: HADOOP-11794.001.patch, HADOOP-11794.002.patch, > HADOOP-11794.003.patch, MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14015) Partitions on Remote HDFS break encryption-zone checks
[ https://issues.apache.org/jira/browse/HADOOP-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-14015: -- Attachment: HADOOP-14015.1.patch Here, I fetch the {{FileSystem}} instance appropriate for the data-path. > Partitions on Remote HDFS break encryption-zone checks > -- > > Key: HADOOP-14015 > URL: https://issues.apache.org/jira/browse/HADOOP-14015 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 1.2.1, 2.1.1-beta >Reporter: Mithun Radhakrishnan > Attachments: HADOOP-14015.1.patch > > > This is in relation to HIVE-13243, which fixes encryption-zone checks for > external tables. > Unfortunately, this is still borked for partitions with remote HDFS paths. > The code fails as follows: > {noformat} > 2015-12-09 19:26:14,997 ERROR [pool-4-thread-1476] server.TThreadPoolServer > (TThreadPoolServer.java:run_aroundBody0(305)) - Error occurred during > processing of message. > java.lang.IllegalArgumentException: Wrong FS: > hdfs://remote-cluster-nn1.myth.net:8020/dbs/mythdb/myth_table/dt=20170120, > expected: hdfs://local-cluster-n1.myth.net:8020 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1985) > at > org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262) > at > org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1290) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.checkTrashPurgeCombination(HiveMetaStore.java:1746) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partitions_req(HiveMetaStore.java:2974) > at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy5.drop_partitions_req(Unknown Source) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:10005) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:9989) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:767) > at > org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:763) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) > at > org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:763) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody0(TThreadPoolServer.java:285) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody1$advice(TThreadPoolServer.java:101) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:1) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > I have a really simple fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14015) Partitions on Remote HDFS break encryption-zone checks
[ https://issues.apache.org/jira/browse/HADOOP-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-14015: -- Affects Version/s: 1.2.1 2.1.1-beta Status: Patch Available (was: Open) > Partitions on Remote HDFS break encryption-zone checks > -- > > Key: HADOOP-14015 > URL: https://issues.apache.org/jira/browse/HADOOP-14015 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.1.1-beta, 1.2.1 >Reporter: Mithun Radhakrishnan > Attachments: HADOOP-14015.1.patch > > > This is in relation to HIVE-13243, which fixes encryption-zone checks for > external tables. > Unfortunately, this is still borked for partitions with remote HDFS paths. > The code fails as follows: > {noformat} > 2015-12-09 19:26:14,997 ERROR [pool-4-thread-1476] server.TThreadPoolServer > (TThreadPoolServer.java:run_aroundBody0(305)) - Error occurred during > processing of message. > java.lang.IllegalArgumentException: Wrong FS: > hdfs://remote-cluster-nn1.myth.net:8020/dbs/mythdb/myth_table/dt=20170120, > expected: hdfs://local-cluster-n1.myth.net:8020 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1985) > at > org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262) > at > org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1290) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.checkTrashPurgeCombination(HiveMetaStore.java:1746) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partitions_req(HiveMetaStore.java:2974) > at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy5.drop_partitions_req(Unknown Source) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:10005) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:9989) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:767) > at > org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:763) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) > at > org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:763) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody0(TThreadPoolServer.java:285) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody1$advice(TThreadPoolServer.java:101) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:1) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > I have a really simple fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Created] (HADOOP-14015) Partitions on Remote HDFS break encryption-zone checks
Mithun Radhakrishnan created HADOOP-14015: - Summary: Partitions on Remote HDFS break encryption-zone checks Key: HADOOP-14015 URL: https://issues.apache.org/jira/browse/HADOOP-14015 Project: Hadoop Common Issue Type: Bug Reporter: Mithun Radhakrishnan This is in relation to HIVE-13243, which fixes encryption-zone checks for external tables. Unfortunately, this is still borked for partitions with remote HDFS paths. The code fails as follows: {noformat} 2015-12-09 19:26:14,997 ERROR [pool-4-thread-1476] server.TThreadPoolServer (TThreadPoolServer.java:run_aroundBody0(305)) - Error occurred during processing of message. java.lang.IllegalArgumentException: Wrong FS: hdfs://remote-cluster-nn1.myth.net:8020/dbs/mythdb/myth_table/dt=20170120, expected: hdfs://local-cluster-n1.myth.net:8020 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193) at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1985) at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262) at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1290) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.checkTrashPurgeCombination(HiveMetaStore.java:1746) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partitions_req(HiveMetaStore.java:2974) at sun.reflect.GeneratedMethodAccessor49.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) at com.sun.proxy.$Proxy5.drop_partitions_req(Unknown Source) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:10005) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$drop_partitions_req.getResult(ThriftHiveMetastore.java:9989) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:767) at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$2.run(HadoopThriftAuthBridge.java:763) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:763) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody0(TThreadPoolServer.java:285) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run_aroundBody1$advice(TThreadPoolServer.java:101) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:1) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} I have a really simple fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832502#comment-15832502 ] Mithun Radhakrishnan commented on HADOOP-8143: -- It's that time of year again when one wonders whether this fix may be considered for submission. :] How about it, chaps? > Change distcp to have -pb on by default > --- > > Key: HADOOP-8143 > URL: https://issues.apache.org/jira/browse/HADOOP-8143 > Project: Hadoop Common > Issue Type: Improvement >Reporter: Dave Thompson >Assignee: Mithun Radhakrishnan >Priority: Minor > Labels: BB2015-05-TBR > Attachments: HADOOP-8143.1.patch > > > We should have the preserve blocksize (-pb) on in distcp by default. > checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12473) distcp's ignoring failures option should be mutually exclusive with the atomic option
[ https://issues.apache.org/jira/browse/HADOOP-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068624#comment-15068624 ] Mithun Radhakrishnan commented on HADOOP-12473: --- [~jira.shegalov], that is an interesting take. Hmm. Between you and me, I think no one should be using {{-i}} at all, in atomic copies or otherwise. It was included to be backward compatible with DistCpV1, for those with an inexplicable tolerance for bad data. :] {{-atomic}} was added so that users have the choice of staging their copies to a temp-location, before atomically moving them to the target location. I guessed there might be users who'd want to stage data before moving them, but could also tolerate bad copies. But I do see your point of view. {{-i}} could be useful to work around annoying copy errors. For instance, there was a time when {{-skipCrc}} wouldn't work correctly, and copying files with different block-sizes (or empty files) would result in CRC failures. {{-i}} would let workflows complete while DistCp was under fix. Removing this makes the workaround unavailable when {{-atomic}} is used. I'm on the fence here, but tending in your direction. I'd be happy to go along, if you could another "Aye!" from a committer. Paging [~jlowe] and [~daryn]. > distcp's ignoring failures option should be mutually exclusive with the > atomic option > - > > Key: HADOOP-12473 > URL: https://issues.apache.org/jira/browse/HADOOP-12473 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp >Affects Versions: 2.7.1 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > > In {{CopyMapper::handleFailure}}, the mapper handles failure and will ignore > it if no it's config key is on. Ignoring failures option {{-i}} should be > mutually exclusive with the {{-atomic}} option otherwise an incomplete dir is > eligible for commit defeating the purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067007#comment-15067007 ] Mithun Radhakrishnan commented on HADOOP-11794: --- [~yzhangal], bq. Appreciate your excellent work! You're too kind. :] bq. But I'm making it more flexible here, such that we can support variable number blocks per split. I agree with the principle of what you're suggesting. Combining multiple splits into a larger split (based on size) is a problem that {{CombineFileInputFormat}} provides a solution for. Do you think we can use {{CombineFileInputFormat}} to combine block-level splits into a larger split? bq. We need some new client-namenode API protocol to get back the locatedBlocks for the specified block range... Hmm... Do we? DistCp copies whole files (even if at a split level). Since we can retrieve located blocks for all blocks in the file, shouldn't that be enough? We could group locatedBlocks by block-id. Perhaps I'm missing something. > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12469) distcp should not ignore the ignoreFailures option
[ https://issues.apache.org/jira/browse/HADOOP-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067171#comment-15067171 ] Mithun Radhakrishnan commented on HADOOP-12469: --- Ah, I see what you did there. +1. > distcp should not ignore the ignoreFailures option > -- > > Key: HADOOP-12469 > URL: https://issues.apache.org/jira/browse/HADOOP-12469 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp >Affects Versions: 2.7.1 >Reporter: Gera Shegalov >Assignee: Mingliang Liu >Priority: Critical > Fix For: 2.8.0 > > Attachments: HADOOP-12469.000.patch, HADOOP-12469.001.patch > > > {{RetriableFileCopyCommand.CopyReadException}} is double-wrapped via > # via {{RetriableCommand::execute}} > # via {{CopyMapper#copyFileWithRetry}} > before {{CopyMapper::handleFailure}} tests > {code} > if (ignoreFailures && exception.getCause() instanceof > RetriableFileCopyCommand.CopyReadException > {code} > which is always false. > Orthogonally, ignoring failures should be mutually exclusive with the atomic > option otherwise an incomplete dir is eligible for commit defeating the > purpose. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067227#comment-15067227 ] Mithun Radhakrishnan commented on HADOOP-11794: --- Ah, I finally see. That makes complete sense. Thank you for the pointer to the JIRA. Also, {{CombineFileInputFormat}} might work with {{UniformSizeInputFormat}}, but it might not with {{DynamicInputFormat}}. Maybe combining a configurable number of blocks (ranges) into splits would be easier to work with. I see what you're doing, and I agree. > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12473) distcp's ignoring failures should be mutually exclusive with the atomic option
[ https://issues.apache.org/jira/browse/HADOOP-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067214#comment-15067214 ] Mithun Radhakrishnan commented on HADOOP-12473: --- [~jira.shegalov], [~liuml07], could you please explain the reasoning behind this fix? If I understand correctly, Gera's orthogonal suggestion in HADOOP-12469 was to make {{-atomic}} and {{-i}} mutually exclusive. The latest patch in this JIRA doesn't seem to address this concern, AFAICT. It makes {{ignoreFailures}} an {{AtomicBoolean}}, which is not what Gera was getting at, I believe. Also, [~jira.shegalov], why do you recommend that {{-atomic}} and {{-i}} be mutually exclusive? Aren't they orthogonal concerns? Why consider {{-atomic}} as incapable of ignoring copy-errors? > distcp's ignoring failures should be mutually exclusive with the atomic option > -- > > Key: HADOOP-12473 > URL: https://issues.apache.org/jira/browse/HADOOP-12473 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp >Affects Versions: 2.7.1 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-12473.000.patch, HADOOP-12473.001.patch, > HADOOP-12473.002.patch > > > In {{CopyMapper::handleFailure}}, the mapper handles failure and will ignore > it if no it's config key is on. Ignoring failures should be mutually > exclusive with the atomic option otherwise an incomplete dir is eligible for > commit defeating the purpose. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067142#comment-15067142 ] Mithun Radhakrishnan commented on HADOOP-11794: --- bq. My argument is that fetching all block locations for a file is not as efficient as fetching only the block range the mapper is assigned to work on. Thank you for explaining. Let me see if I can phrase my questions more clearly than before: # Would it make sense to include the block-locations within the splits, at the time of split-calculation, instead of the block-ranges? If yes, then we can make do with the API we already have, by fetching locatedBlocks for all files, and grouping them among the DistCp splits. (It is indeed possible that keeping ranges, and using your proposed API on the map-side might be faster. But those map-side calls might possibly also exert more parallel load on the name-node, depending on the number of maps.) # Naive question: Why do we need to identify locatedBlocks? Don't HDFS files have uniformly sized blocks (within a file)? As such, aren't the block-boundaries implicit (i.e. from {{blockId*blockSize}} to {{(blockId+1)*(blockSize) - 1}})? Can't we simply copy that range of bytes into a new file (and stitch the new files in reduce)? > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044105#comment-15044105 ] Mithun Radhakrishnan commented on HADOOP-11794: --- [~yzhangal]: Thank you, sir. Please do. Hive has kept me busy enough not to devote time here. I'd be happy to review your work. I had a patch a couple of years ago which split files on block-boundaries, copied them over, and then stitched them together using {{DistributedFileSystem.concat()}} in a reduce-step. If I can find the patch, I'll ping it to you, but it's not terribly hard to do this from scratch. The prototype had very promising performance. I look forward to your solution. > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Mithun Radhakrishnan > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-11794: -- Assignee: Yongjun Zhang (was: Mithun Radhakrishnan) > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11794) distcp can copy blocks in parallel
[ https://issues.apache.org/jira/browse/HADOOP-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15044373#comment-15044373 ] Mithun Radhakrishnan commented on HADOOP-11794: --- Sorry, no. That's likely [~dhruba]'s work, which might have been based on the DistCp-v1 code. We'll need new code for the DistCp-v2 code (i.e. my rewrite from MAPREDUCE-2765). Apologies if you've already thought this through. One would need to change the {{DynamicInputFormat#createSplits()}} implementation, which currently looks thus: {code:java:borderStyle=solid:title=DynamicInputFormat.java} private List createSplits(JobContext jobContext, List chunks) throws IOException { int numMaps = getNumMapTasks(jobContext.getConfiguration()); final int nSplits = Math.min(numMaps, chunks.size()); List splits = new ArrayList(nSplits); for (int i=0; i< nSplits; ++i) { TaskID taskId = new TaskID(jobContext.getJobID(), TaskType.MAP, i); chunks.get(i).assignTo(taskId); splits.add(new FileSplit(chunks.get(i).getPath(), 0, // Setting non-zero length for FileSplit size, to avoid a possible // future when 0-sized file-splits are considered "empty" and skipped // over. getMinRecordsPerChunk(jobContext.getConfiguration()), null)); } DistCpUtils.publish(jobContext.getConfiguration(), CONF_LABEL_NUM_SPLITS, splits.size()); return splits; } {code} You'll need to create a {{FileSplit}} per file-block (by first examining the file's block-size). The mappers will now need to emit something like {{(relativePathForOriginalSourceFile, targetLocation_with_block_number)}}. By keying on the relative-source-paths (+ expected number of blocks), you can get all the target-block-locations to hit the same reducer, where you can stitch them together. Good luck. :] > distcp can copy blocks in parallel > -- > > Key: HADOOP-11794 > URL: https://issues.apache.org/jira/browse/HADOOP-11794 > Project: Hadoop Common > Issue Type: Improvement > Components: tools/distcp >Affects Versions: 0.21.0 >Reporter: dhruba borthakur >Assignee: Yongjun Zhang > Attachments: MAPREDUCE-2257.patch > > > The minimum unit of work for a distcp task is a file. We have files that are > greater than 1 TB with a block size of 1 GB. If we use distcp to copy these > files, the tasks either take a long long long time or finally fails. A better > way for distcp would be to copy all the source blocks in parallel, and then > stich the blocks back to files at the destination via the HDFS Concat API > (HDFS-222) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190320#comment-14190320 ] Mithun Radhakrishnan commented on HADOOP-8143: -- Hh, actually, thank you for correcting me, Allen. You're right, the data-corruption part is a clumsily worded overstatement, and not ORC-specific. 1. Checksum-verifications between source and target are guaranteed to fail between files with identical contents, but different block-sizes (and span blocks). If HDFS has been working to fix this, do let me know of the JIRA. The only way to have DistCp succeed in copying them is to skip checksums. And this raises the potential for bad copies of the file, regardless of format. 2. There's potential for performance degradation when ORC files with large stripes are copied to clusters with smaller block-sizes, if block-sizes aren't preserved. While #2 is of some concern, #1 is of maximum import. Change distcp to have -pb on by default --- Key: HADOOP-8143 URL: https://issues.apache.org/jira/browse/HADOOP-8143 Project: Hadoop Common Issue Type: Improvement Reporter: Dave Thompson Assignee: Mithun Radhakrishnan Priority: Minor Attachments: HADOOP-8143.1.patch We should have the preserve blocksize (-pb) on in distcp by default. checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan reopened HADOOP-8143: -- Chaps, would it be ok if we revisited ask? 1. [~davet]'s original problem remains, i.e. copying files between 2 clusters with different default block-sizes will fail, without either -pb or -skipCrc. HADOOP-8233 only solves this for 0-byte files. 2. File-formats such as ORC perform several optimizations w.r.t. data-stripes and HDFS-block-sizes. If such files were to be copied between clusters without preserving block-sizes, there would ensue performance-fails (at best) or data-corruptions (at worst). Would it be acceptable to preserve block-sizes by default (i.e. if -p isn't used), only if the source and target file-systems are HDFS? Change distcp to have -pb on by default --- Key: HADOOP-8143 URL: https://issues.apache.org/jira/browse/HADOOP-8143 Project: Hadoop Common Issue Type: Improvement Reporter: Dave Thompson Assignee: Dave Thompson Priority: Minor We should have the preserve blocksize (-pb) on in distcp by default. checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-8143: - Attachment: HADOOP-8143.1.patch Tentative fix. This preserves block-size by default (but only if -p isn't specified at all). This assumes that if the user said {{-pug}}, then block-size was deliberately left out. Change distcp to have -pb on by default --- Key: HADOOP-8143 URL: https://issues.apache.org/jira/browse/HADOOP-8143 Project: Hadoop Common Issue Type: Improvement Reporter: Dave Thompson Assignee: Dave Thompson Priority: Minor Attachments: HADOOP-8143.1.patch We should have the preserve blocksize (-pb) on in distcp by default. checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan updated HADOOP-8143: - Status: Patch Available (was: Reopened) Change distcp to have -pb on by default --- Key: HADOOP-8143 URL: https://issues.apache.org/jira/browse/HADOOP-8143 Project: Hadoop Common Issue Type: Improvement Reporter: Dave Thompson Assignee: Dave Thompson Priority: Minor Attachments: HADOOP-8143.1.patch We should have the preserve blocksize (-pb) on in distcp by default. checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mithun Radhakrishnan reassigned HADOOP-8143: Assignee: Mithun Radhakrishnan (was: Dave Thompson) Change distcp to have -pb on by default --- Key: HADOOP-8143 URL: https://issues.apache.org/jira/browse/HADOOP-8143 Project: Hadoop Common Issue Type: Improvement Reporter: Dave Thompson Assignee: Mithun Radhakrishnan Priority: Minor Attachments: HADOOP-8143.1.patch We should have the preserve blocksize (-pb) on in distcp by default. checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-8143) Change distcp to have -pb on by default
[ https://issues.apache.org/jira/browse/HADOOP-8143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14189345#comment-14189345 ] Mithun Radhakrishnan commented on HADOOP-8143: -- [~aw] bq. forcing block size will break non-HDFS methods in surprising ways. Here's the code in DistCp that is affected by preserving block-size: {code:java} private static long getBlockSize( EnumSetFileAttribute fileAttributes, FileStatus sourceFile, FileSystem targetFS, Path tmpTargetPath) { boolean preserve = fileAttributes.contains(FileAttribute.BLOCKSIZE) || fileAttributes.contains(FileAttribute.CHECKSUMTYPE); return preserve ? sourceFile.getBlockSize() : targetFS .getDefaultBlockSize(tmpTargetPath); } {code} Would the concern be that {{FileStatus.getBlockSize()}} might conk if the source-file isn't on HDFS? It's more likely that {{FileSystem.getDefaultBlockSize()}} is being called for a non-HDFS file-system as well, by default. Change distcp to have -pb on by default --- Key: HADOOP-8143 URL: https://issues.apache.org/jira/browse/HADOOP-8143 Project: Hadoop Common Issue Type: Improvement Reporter: Dave Thompson Assignee: Mithun Radhakrishnan Priority: Minor Attachments: HADOOP-8143.1.patch We should have the preserve blocksize (-pb) on in distcp by default. checksum which is on by default will always fail if blocksize is not the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-10129) Distcp may succeed when it fails
[ https://issues.apache.org/jira/browse/HADOOP-10129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832934#comment-13832934 ] Mithun Radhakrishnan commented on HADOOP-10129: --- Thanks, Daryn. FWIW, the patch looks good. +1. Sorry, I thought this was already resolved. I could've sworn I'd posted a fix on another JIRA to mimick the Y!Internal version's behaviour (i.e. call outstream.close() explicitly). Distcp may succeed when it fails Key: HADOOP-10129 URL: https://issues.apache.org/jira/browse/HADOOP-10129 Project: Hadoop Common Issue Type: Bug Components: tools/distcp Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 Reporter: Daryn Sharp Assignee: Daryn Sharp Priority: Critical Attachments: HADOOP-10129.patch Distcp uses {{IOUtils.cleanup}} to close its output streams w/o previously attempting to close the streams. {{IOUtils.cleanup}} will swallow close or implicit flush on close exceptions. As a result, distcp may silently skip files when a partial file listing is generated, and/or appear to succeed when individual copies fail. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HADOOP-8225) DistCp fails when invoked by Oozie
[ https://issues.apache.org/jira/browse/HADOOP-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424154#comment-13424154 ] Mithun Radhakrishnan commented on HADOOP-8225: -- Hello, Daryn. Thanks so much for reviewing the fix. I'm pretty sure that DistCp handles tokens correctly. This code-path was introduced purely for the case where DistCp is invoked via Oozie. I wish there were another way to transfer delegation-tokens from Oozie's launcher over to DistCp. (This is also the way Pig and Hive actions work in Oozie.) DistCp fails when invoked by Oozie -- Key: HADOOP-8225 URL: https://issues.apache.org/jira/browse/HADOOP-8225 Project: Hadoop Common Issue Type: Bug Affects Versions: 0.23.1 Reporter: Mithun Radhakrishnan Attachments: HADOOP-8225.patch, HADOOP-8225.patch When DistCp is invoked through a proxy-user (e.g. through Oozie), the delegation-token-store isn't picked up by DistCp correctly. One sees failures such as: ERROR [main] org.apache.hadoop.tools.DistCp: Couldn't complete DistCp operation: java.lang.SecurityException: Intercepted System.exit(-999) at org.apache.oozie.action.hadoop.LauncherSecurityManager.checkExit(LauncherMapper.java:651) at java.lang.Runtime.exit(Runtime.java:88) at java.lang.System.exit(System.java:904) at org.apache.hadoop.tools.DistCp.main(DistCp.java:357) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:394) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:399) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:334) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:147) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:142) Looking over the DistCp code, one sees that HADOOP_TOKEN_FILE_LOCATION isn't being copied to mapreduce.job.credentials.binary, in the job-conf. I'll post a patch for this shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira