date:20150804


[ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653438#comment-14653438
 ] 

Hudson commented on HDFS-8850:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #277 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/277/])
HDFS-8850. VolumeScanner thread exits with exception if there is no block pool 
to be scanned but there are suspicious blocks. (Colin Patrick McCabe via yliu) 
(yliu: rev 9a0806824e8982ec3b47315f9b960474186a)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java


 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8762) Erasure Coding: the log of each streamer should show its index


[ 
https://issues.apache.org/jira/browse/HDFS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653445#comment-14653445
 ] 

Hadoop QA commented on HDFS-8762:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  19m  4s | Pre-patch HDFS-7285 has 5 
extant Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 57s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 13s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 15s | The applied patch generated 
1 release audit warnings. |
| {color:red}-1{color} | checkstyle |   2m 29s | The applied patch generated  3 
new checkstyle issues (total was 97, now 99). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   3m 26s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 16s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests |  56m 22s | Tests failed in hadoop-hdfs. |
| | | 105m 15s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.hdfs.server.blockmanagement.TestDatanodeManager |
|   | hadoop.hdfs.server.namenode.snapshot.TestUpdatePipelineWithSnapshots |
|   | hadoop.hdfs.TestModTime |
|   | hadoop.fs.TestUrlStreamHandler |
|   | hadoop.hdfs.security.TestDelegationToken |
|   | hadoop.hdfs.server.namenode.TestBlockPlacementPolicyRackFaultTolarent |
|   | hadoop.hdfs.server.namenode.TestFileLimit |
|   | hadoop.hdfs.TestParallelShortCircuitRead |
|   | hadoop.hdfs.server.namenode.snapshot.TestFileContextSnapshot |
|   | hadoop.hdfs.server.blockmanagement.TestBlockInfoStriped |
|   | hadoop.hdfs.server.namenode.TestEditLogAutoroll |
|   | hadoop.TestRefreshCallQueue |
|   | hadoop.hdfs.protocolPB.TestPBHelper |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints |
|   | hadoop.cli.TestCryptoAdminCLI |
|   | hadoop.hdfs.TestSetrepDecreasing |
|   | hadoop.hdfs.server.datanode.TestDiskError |
|   | hadoop.fs.viewfs.TestViewFsWithAcls |
|   | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes |
|   | hadoop.hdfs.server.namenode.TestAddStripedBlocks |
|   | hadoop.hdfs.server.namenode.TestFSEditLogLoader |
|   | hadoop.hdfs.server.namenode.TestHostsFiles |
|   | hadoop.hdfs.server.datanode.TestTransferRbw |
|   | hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistPolicy |
|   | hadoop.fs.contract.hdfs.TestHDFSContractDelete |
|   | hadoop.hdfs.server.namenode.TestFileContextAcl |
|   | hadoop.fs.TestFcHdfsSetUMask |
|   | hadoop.fs.TestUnbuffer |
|   | hadoop.hdfs.server.namenode.TestClusterId |
|   | hadoop.hdfs.server.namenode.TestDeleteRace |
|   | hadoop.hdfs.server.namenode.TestFSDirectory |
|   | hadoop.hdfs.server.namenode.TestLeaseManager |
|   | hadoop.fs.contract.hdfs.TestHDFSContractOpen |
|   | hadoop.hdfs.server.namenode.snapshot.TestSnapshotListing |
|   | hadoop.hdfs.server.datanode.TestStorageReport |
|   | hadoop.hdfs.server.datanode.TestBlockRecovery |
|   | hadoop.hdfs.server.namenode.TestFileTruncate |
|   | hadoop.hdfs.TestReadWhileWriting |
|   | hadoop.fs.contract.hdfs.TestHDFSContractMkdir |
|   | hadoop.fs.contract.hdfs.TestHDFSContractAppend |
|   | hadoop.hdfs.server.datanode.TestFsDatasetCache |
|   | hadoop.hdfs.server.datanode.fsdataset.impl.TestRbwSpaceReservation |
|   | hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock |
|   | hadoop.hdfs.server.namenode.ha.TestQuotasWithHA |
|   | hadoop.hdfs.server.namenode.ha.TestGetGroupsWithHA |
|   | hadoop.hdfs.TestReadStripedFileWithMissingBlocks |
|   | hadoop.hdfs.server.namenode.TestSecondaryWebUi |
|   | hadoop.hdfs.server.namenode.TestMalformedURLs |
|   | hadoop.hdfs.server.namenode.TestAuditLogger |
|   | hadoop.hdfs.server.namenode.TestRecoverStripedBlocks |
|   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS |
|   | hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistFiles |
|   | hadoop.hdfs.TestWriteBlockGetsBlockLengthHint |
|   | hadoop.hdfs.server.namenode.TestHDFSConcat |
|   | hadoop.hdfs.server.datanode.TestCachingStrategy |
|   | hadoop.hdfs.server.namenode.TestAddBlockRetry |
|   | hadoop.fs.TestSymlinkHdfsFileSystem |
|   | hadoop.fs.viewfs.TestViewFsDefaultValue |
|   | hadoop.fs.TestSymlinkHdfsFileContext |
|   | hadoop.hdfs.TestClientProtocolForPipelineRecovery |
|   |

[jira] [Updated] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small


 [ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-8838:

Status: Open  (was: Patch Available)

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, 
 h8838_20150731.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small


 [ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-8838:

Attachment: h8838_20150731.log

The Jenkins is broken.  I ran the tests locally. Please check the attached logs.

Some comments about the failed tests:
1. Is it necessary to rename DN dirs of MiniDFSCluster ?
2. Junit complains TestBase not public.
3. The numBytes returned from getBlockGroup() is wrong. We should use the old 
method, which is to use streamer #0 to check block boundry, but not other 
streamers, as it used to be before the patch.

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, 
 h8838_20150731.log, h8838_20150731.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8413) Directories are not listed recursively when fs.defaultFs is viewFs


 [ 
https://issues.apache.org/jira/browse/HDFS-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated HDFS-8413:
--
Status: Open  (was: Patch Available)

 Directories are not listed recursively when fs.defaultFs is viewFs
 --

 Key: HDFS-8413
 URL: https://issues.apache.org/jira/browse/HDFS-8413
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Ajith S
Assignee: Ajith S
  Labels: viewfs
 Attachments: HDFS-8413.patch


 Mount a cluster on client throught viewFs mount table
 Example:
 {quote}
  property
 namefs.defaultFS/name
 valueviewfs:value
   /property
 property
 namefs.viewfs.mounttable.default.link./nn1/name
 valuehdfs://ns1//value  !-- HA nameservice --
 /property
 property
 namefs.viewfs.mounttable.default.link./user/name
 valuehdfs://host-72:8020//value
 /property
  property
 {quote}
 Try to list the files recursively *(hdfs dfs -ls -R / or hadoop fs -ls -R /)* 
 only the parent folders are listed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks


[ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653709#comment-14653709
 ] 

Hudson commented on HDFS-8850:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2223 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2223/])
HDFS-8850. VolumeScanner thread exits with exception if there is no block pool 
to be scanned but there are suspicious blocks. (Colin Patrick McCabe via yliu) 
(yliu: rev 9a0806824e8982ec3b47315f9b960474186a)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java


 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8833) Erasure coding: store EC schema and cell size in INodeFile and eliminate notion of EC zones


[ 
https://issues.apache.org/jira/browse/HDFS-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653161#comment-14653161
 ] 

Zhe Zhang commented on HDFS-8833:
-

Thanks Andrew for the suggestion. I think {{*-on-create}} flags is a good 
solution to the future compatibility concern.

[~walter.k.su] also mentioned some ideas in an offline discussion.

To summarize, below is the proposed design for this change:

# ErasureCodingPolicy table
#* Create {{ErasureCodingPolicy}} class, with {{ECSchema}} and {{cellSize}}.
#* Create {{ErasureCodingPolicySuite}} class to manage a table of supported 
policies (or extend {{ErasureCodingSchemaManager}}). Something like:
{code}
0: RS-6-3 (schema), 64KB (cellSize)
1: RS-6-3 (schema), 128KB (cellSize)
2: RS-10-4 (schema), 1MB (cellSize)
{code}
#* [follow-on] Allow customize policies stored in XAttr
# File header change
#* Remove {{isStriped}} from {{INodeFile}} header and reduce replication factor 
to 6 bits.
{code}
  /** 
   * Bit format:
   * [4-bit storagePolicyID][6-bit erasureCodingPolicy]
   * [6-bit replication][48-bit preferredBlockSize]
   */
{code}
#* Store ID of ECPolicy with 6 bits in header -- 64 policies allowed
#* The ECPolicy is *always set* when creating a file, taking value from its 
ancestors; {{0}} can be used to represent contiguous layout.
#* [follow-on] Add {{inherit-on-create}} flag as Andrew suggested above
# Directory XAttr change
#* Directory's ECPolicy XAttr can be empty, indicating the ECPolicy is the same 
as ancestor. Otherwise its own XAttr determines the policy for newly created 
files under the directory.
# Renaming
#* A renamed file keeps the ECPolicy in its header.
#* Therefore, a directory can have files with different ECPolicies.
#* Conversion not explicitly support. If needed a file can be converted by 
cp+rm.
#* When renamed, a directory carries over its ECPolicy if it's set (XAttr 
non-empty). Otherwise its XAttr remains empty (and newly created files under 
the moved directory will use policy from the new ancestors). 

Questions / comments are very welcome.

 Erasure coding: store EC schema and cell size in INodeFile and eliminate 
 notion of EC zones
 ---

 Key: HDFS-8833
 URL: https://issues.apache.org/jira/browse/HDFS-8833
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: HDFS-7285
Reporter: Zhe Zhang
Assignee: Zhe Zhang

 We have [discussed | 
 https://issues.apache.org/jira/browse/HDFS-7285?focusedCommentId=14357754page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357754]
  storing EC schema with files instead of EC zones and recently revisited the 
 discussion under HDFS-8059.
 As a recap, the _zone_ concept has severe limitations including renaming and 
 nested configuration. Those limitations are valid in encryption for security 
 reasons and it doesn't make sense to carry them over in EC.
 This JIRA aims to store EC schema and cell size on {{INodeFile}} level. For 
 simplicity, we should first implement it as an xattr and consider memory 
 optimizations (such as moving it to file header) as a follow-on. We should 
 also disable changing EC policy on a non-empty file / dir in the first phase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails


 [ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8704:

Attachment: HDFS-8704-HDFS-7285-005.patch

 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small


 [ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-8838:

Attachment: h8838_20150731-HDFS-7285.patch

LGTM. +1. Upload the same patch for [~szetszwo] to trigger Jenkins.

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, 
 h8838_20150731.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8829) DataNode sets SO_RCVBUF explicitly is disabling tcp auto-tuning

2015-08-04 Thread kanaka kumar avvaru (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653164#comment-14653164
 ] 

kanaka kumar avvaru commented on HDFS-8829:
---

Hi [~He Tianyi], similar configuration changes we have added in our cluster 
too. If you have patch available, feel free to assign to you and submit the 
patch.

 DataNode sets SO_RCVBUF explicitly is disabling tcp auto-tuning
 ---

 Key: HDFS-8829
 URL: https://issues.apache.org/jira/browse/HDFS-8829
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.3.0, 2.6.0
Reporter: He Tianyi
Assignee: kanaka kumar avvaru

 {code:java}
   private void initDataXceiver(Configuration conf) throws IOException {
 // find free port or use privileged port provided
 TcpPeerServer tcpPeerServer;
 if (secureResources != null) {
   tcpPeerServer = new TcpPeerServer(secureResources);
 } else {
   tcpPeerServer = new TcpPeerServer(dnConf.socketWriteTimeout,
   DataNode.getStreamingAddr(conf));
 }
 
 tcpPeerServer.setReceiveBufferSize(HdfsConstants.DEFAULT_DATA_SOCKET_SIZE);
 {code}
 The last line sets SO_RCVBUF explicitly, thus disabling tcp auto-tuning on 
 some system.
 Shall we make this behavior configurable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-8574) When block count for a volume exceeds dfs.blockreport.split.threshold, block report causes exception


 [ 
https://issues.apache.org/jira/browse/HDFS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S resolved HDFS-8574.
---
Resolution: Not A Problem

 When block count for a volume exceeds dfs.blockreport.split.threshold, block 
 report causes exception
 

 Key: HDFS-8574
 URL: https://issues.apache.org/jira/browse/HDFS-8574
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Ajith S
Assignee: Ajith S

 This piece of code in 
 {{org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport()}}
 {code}
 // Send one block report per message.
 for (int r = 0; r  reports.length; r++) {
   StorageBlockReport singleReport[] = { reports[r] };
   DatanodeCommand cmd = bpNamenode.blockReport(
   bpRegistration, bpos.getBlockPoolId(), singleReport,
   new BlockReportContext(reports.length, r, reportId));
   numReportsSent++;
   numRPCs++;
   if (cmd != null) {
 cmds.add(cmd);
   }
 {code}
 when a single volume contains many blocks, i.e more than the threshold, it is 
 trying to send the entire blockreport in one RPC, causing exception
 {code}
 java.lang.IllegalStateException: 
 com.google.protobuf.InvalidProtocolBufferException: Protocol message was too 
 large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase 
 the size limit.
 at 
 org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:369)
 at 
 org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:347)
 at 
 org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder.getBlockListAsLongs(BlockListAsLongs.java:325)
 at 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:190)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:473)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8853) Erasure Coding: Provide ECSchema validation when creating ECZone

Rakesh R created HDFS-8853:
--

 Summary: Erasure Coding: Provide ECSchema validation when creating 
ECZone
 Key: HDFS-8853
 URL: https://issues.apache.org/jira/browse/HDFS-8853
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Rakesh R


Presently the {{DFS#createErasureCodingZone(path, ecSchema, cellSize)}} doesn't 
have any validation that the given {{ecSchema}} is available in 
{{ErasureCodingSchemaManager#activeSchemas}} list. Now, if it doesn't exists 
then will create the ECZone with {{null}} schema. IMHO we could improve this by 
doing necessary basic sanity checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8693) refreshNamenodes does not support adding a new standby to a running DN


[ 
https://issues.apache.org/jira/browse/HDFS-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653147#comment-14653147
 ] 

Ajith S commented on HDFS-8693:
---

Hi [~kihwal]

I tested with federated HA cluster when adding a new nameservice, the command 
works. Is there any special scenario when you said it doesn't work for 
federated HA cluster.?

 refreshNamenodes does not support adding a new standby to a running DN
 --

 Key: HDFS-8693
 URL: https://issues.apache.org/jira/browse/HDFS-8693
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, ha
Affects Versions: 2.6.0
Reporter: Jian Fang
Priority: Critical

 I tried to run the following command on a Hadoop 2.6.0 cluster with HA 
 support 
 $ hdfs dfsadmin -refreshNamenodes datanode-host:port
 to refresh name nodes on data nodes after I replaced one name node with a new 
 one so that I don't need to restart the data nodes. However, I got the 
 following error:
 refreshNamenodes: HA does not currently support adding a new standby to a 
 running DN. Please do a rolling restart of DNs to reconfigure the list of NNs.
 I checked the 2.6.0 code and the error was thrown by the following code 
 snippet, which led me to this JIRA.
 void refreshNNList(ArrayListInetSocketAddress addrs) throws IOException {
 SetInetSocketAddress oldAddrs = Sets.newHashSet();
 for (BPServiceActor actor : bpServices)
 { oldAddrs.add(actor.getNNSocketAddress()); }
 SetInetSocketAddress newAddrs = Sets.newHashSet(addrs);
 if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty())
 { // Keep things simple for now -- we can implement this at a later date. 
 throw new IOException( HA does not currently support adding a new standby to 
 a running DN.  + Please do a rolling restart of DNs to reconfigure the list 
 of NNs.); }
 }
 Looks like this the refreshNameNodes command is an uncompleted feature. 
 Unfortunately, the new name node on a replacement is critical for auto 
 provisioning a hadoop cluster with HDFS HA support. Without this support, the 
 HA feature could not really be used. I also observed that the new standby 
 name node on the replacement instance could stuck in safe mode because no 
 data nodes check in with it. Even with a rolling restart, it may take quite 
 some time to restart all data nodes if we have a big cluster, for example, 
 with 4000 data nodes, let alone restarting DN is way too intrusive and it is 
 not a preferable operation in production. It also increases the chance for a 
 double failure because the standby name node is not really ready for a 
 failover in the case that the current active name node fails. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small


[ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653156#comment-14653156
 ] 

Hadoop QA commented on HDFS-8838:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748613/h8838_20150731-HDFS-7285.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / c3364ca |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11896/console |


This message was automatically generated.

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, 
 h8838_20150731.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small


[ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653136#comment-14653136
 ] 

Li Bo commented on HDFS-8838:
-

hi, [~walter.k.su] and [~szetszwo], could you help me review the patch of 
HDFS-8704 if you have time? 

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small


 [ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-8838:

Status: Patch Available  (was: Open)

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, 
 h8838_20150731.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8220) Erasure Coding: StripedDataStreamer fails to handle the blocklocations which doesn't satisfy BlockGroupSize


 [ 
https://issues.apache.org/jira/browse/HDFS-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated HDFS-8220:
---
Attachment: HDFS-8220-HDFS-7285-10.patch

 Erasure Coding: StripedDataStreamer fails to handle the blocklocations which 
 doesn't satisfy BlockGroupSize
 ---

 Key: HDFS-8220
 URL: https://issues.apache.org/jira/browse/HDFS-8220
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Rakesh R
Assignee: Rakesh R
 Attachments: HDFS-8220-001.patch, HDFS-8220-002.patch, 
 HDFS-8220-003.patch, HDFS-8220-004.patch, HDFS-8220-HDFS-7285-09.patch, 
 HDFS-8220-HDFS-7285-10.patch, HDFS-8220-HDFS-7285.005.patch, 
 HDFS-8220-HDFS-7285.006.patch, HDFS-8220-HDFS-7285.007.patch, 
 HDFS-8220-HDFS-7285.007.patch, HDFS-8220-HDFS-7285.008.patch


 During write operations {{StripedDataStreamer#locateFollowingBlock}} fails to 
 validate the available datanodes against the {{BlockGroupSize}}. Please see 
 the exception to understand more:
 {code}
 2015-04-22 14:56:11,313 WARN  hdfs.DFSClient (DataStreamer.java:run(538)) - 
 DataStreamer Exception
 java.lang.NullPointerException
   at 
 java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:374)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.locateFollowingBlock(StripedDataStreamer.java:157)
   at 
 org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1332)
   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:424)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:1)
 2015-04-22 14:56:11,313 INFO  hdfs.MiniDFSCluster 
 (MiniDFSCluster.java:shutdown(1718)) - Shutting down the Mini HDFS Cluster
 2015-04-22 14:56:11,313 ERROR hdfs.DFSClient 
 (DFSClient.java:closeAllFilesBeingWritten(608)) - Failed to close inode 16387
 java.io.IOException: DataStreamer Exception: 
   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:544)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:1)
 Caused by: java.lang.NullPointerException
   at 
 java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:374)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.locateFollowingBlock(StripedDataStreamer.java:157)
   at 
 org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1332)
   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:424)
   ... 1 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-8693) refreshNamenodes does not support adding a new standby to a running DN


 [ 
https://issues.apache.org/jira/browse/HDFS-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S reassigned HDFS-8693:
-

Assignee: Ajith S

 refreshNamenodes does not support adding a new standby to a running DN
 --

 Key: HDFS-8693
 URL: https://issues.apache.org/jira/browse/HDFS-8693
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, ha
Affects Versions: 2.6.0
Reporter: Jian Fang
Assignee: Ajith S
Priority: Critical

 I tried to run the following command on a Hadoop 2.6.0 cluster with HA 
 support 
 $ hdfs dfsadmin -refreshNamenodes datanode-host:port
 to refresh name nodes on data nodes after I replaced one name node with a new 
 one so that I don't need to restart the data nodes. However, I got the 
 following error:
 refreshNamenodes: HA does not currently support adding a new standby to a 
 running DN. Please do a rolling restart of DNs to reconfigure the list of NNs.
 I checked the 2.6.0 code and the error was thrown by the following code 
 snippet, which led me to this JIRA.
 void refreshNNList(ArrayListInetSocketAddress addrs) throws IOException {
 SetInetSocketAddress oldAddrs = Sets.newHashSet();
 for (BPServiceActor actor : bpServices)
 { oldAddrs.add(actor.getNNSocketAddress()); }
 SetInetSocketAddress newAddrs = Sets.newHashSet(addrs);
 if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty())
 { // Keep things simple for now -- we can implement this at a later date. 
 throw new IOException( HA does not currently support adding a new standby to 
 a running DN.  + Please do a rolling restart of DNs to reconfigure the list 
 of NNs.); }
 }
 Looks like this the refreshNameNodes command is an uncompleted feature. 
 Unfortunately, the new name node on a replacement is critical for auto 
 provisioning a hadoop cluster with HDFS HA support. Without this support, the 
 HA feature could not really be used. I also observed that the new standby 
 name node on the replacement instance could stuck in safe mode because no 
 data nodes check in with it. Even with a rolling restart, it may take quite 
 some time to restart all data nodes if we have a big cluster, for example, 
 with 4000 data nodes, let alone restarting DN is way too intrusive and it is 
 not a preferable operation in production. It also increases the chance for a 
 double failure because the standby name node is not really ready for a 
 failover in the case that the current active name node fails. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8693) refreshNamenodes does not support adding a new standby to a running DN


[ 
https://issues.apache.org/jira/browse/HDFS-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653150#comment-14653150
 ] 

Ajith S commented on HDFS-8693:
---

Hi [~john.jian.fang] and [~kihwal]

Agreed, need to fix refreshNameNodes. In refreshNNList, can we just add a new 
NN actor and replace the old NN actor in block pool service.?? 
I would like to work on this issue :)

 refreshNamenodes does not support adding a new standby to a running DN
 --

 Key: HDFS-8693
 URL: https://issues.apache.org/jira/browse/HDFS-8693
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, ha
Affects Versions: 2.6.0
Reporter: Jian Fang
Priority: Critical

 I tried to run the following command on a Hadoop 2.6.0 cluster with HA 
 support 
 $ hdfs dfsadmin -refreshNamenodes datanode-host:port
 to refresh name nodes on data nodes after I replaced one name node with a new 
 one so that I don't need to restart the data nodes. However, I got the 
 following error:
 refreshNamenodes: HA does not currently support adding a new standby to a 
 running DN. Please do a rolling restart of DNs to reconfigure the list of NNs.
 I checked the 2.6.0 code and the error was thrown by the following code 
 snippet, which led me to this JIRA.
 void refreshNNList(ArrayListInetSocketAddress addrs) throws IOException {
 SetInetSocketAddress oldAddrs = Sets.newHashSet();
 for (BPServiceActor actor : bpServices)
 { oldAddrs.add(actor.getNNSocketAddress()); }
 SetInetSocketAddress newAddrs = Sets.newHashSet(addrs);
 if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty())
 { // Keep things simple for now -- we can implement this at a later date. 
 throw new IOException( HA does not currently support adding a new standby to 
 a running DN.  + Please do a rolling restart of DNs to reconfigure the list 
 of NNs.); }
 }
 Looks like this the refreshNameNodes command is an uncompleted feature. 
 Unfortunately, the new name node on a replacement is critical for auto 
 provisioning a hadoop cluster with HDFS HA support. Without this support, the 
 HA feature could not really be used. I also observed that the new standby 
 name node on the replacement instance could stuck in safe mode because no 
 data nodes check in with it. Even with a rolling restart, it may take quite 
 some time to restart all data nodes if we have a big cluster, for example, 
 with 4000 data nodes, let alone restarting DN is way too intrusive and it is 
 not a preferable operation in production. It also increases the chance for a 
 double failure because the standby name node is not really ready for a 
 failover in the case that the current active name node fails. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8574) When block count for a volume exceeds dfs.blockreport.split.threshold, block report causes exception


[ 
https://issues.apache.org/jira/browse/HDFS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653196#comment-14653196
 ] 

Ajith S commented on HDFS-8574:
---

Closing this issue as per comments

 When block count for a volume exceeds dfs.blockreport.split.threshold, block 
 report causes exception
 

 Key: HDFS-8574
 URL: https://issues.apache.org/jira/browse/HDFS-8574
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Ajith S
Assignee: Ajith S

 This piece of code in 
 {{org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport()}}
 {code}
 // Send one block report per message.
 for (int r = 0; r  reports.length; r++) {
   StorageBlockReport singleReport[] = { reports[r] };
   DatanodeCommand cmd = bpNamenode.blockReport(
   bpRegistration, bpos.getBlockPoolId(), singleReport,
   new BlockReportContext(reports.length, r, reportId));
   numReportsSent++;
   numRPCs++;
   if (cmd != null) {
 cmds.add(cmd);
   }
 {code}
 when a single volume contains many blocks, i.e more than the threshold, it is 
 trying to send the entire blockreport in one RPC, causing exception
 {code}
 java.lang.IllegalStateException: 
 com.google.protobuf.InvalidProtocolBufferException: Protocol message was too 
 large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase 
 the size limit.
 at 
 org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:369)
 at 
 org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder$1.next(BlockListAsLongs.java:347)
 at 
 org.apache.hadoop.hdfs.protocol.BlockListAsLongs$BufferDecoder.getBlockListAsLongs(BlockListAsLongs.java:325)
 at 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:190)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:473)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8663) sys cpu usage high on namenode server

2015-08-04 Thread tangjunjie (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653114#comment-14653114
 ] 

tangjunjie commented on HDFS-8663:
--

For HDFS, the mapping of users to groups is performed on the NameNode. Thus, 
the host system configuration of the NameNode determines the group mappings for 
the users. So  user should create user on NameNode. More detail info can be 
found on
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Group_Mapping

 sys cpu usage high on namenode server
 -

 Key: HDFS-8663
 URL: https://issues.apache.org/jira/browse/HDFS-8663
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: fs, namenode
Affects Versions: 2.3.0
 Environment: hadoop 2.3.0 centos5.8
Reporter: tangjunjie

 sys cpu usage high  on namenode  server lead to run job very slow.
 I use ps -elf see many zombie process.
 I check hdfs log I found many exceptions like:
 org.apache.hadoop.util.Shell$ExitCodeException: id: sem_410: No such user
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
   at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
   at 
 org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:83)
   at 
 org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:52)
   at org.apache.hadoop.security.Groups.getGroups(Groups.java:139)
   at 
 org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1409)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.init(FSPermissionChecker.java:81)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getPermissionChecker(FSNamesystem.java:3310)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3491)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:764)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:764)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980)
 Then I create all user such as sem_410 appear in exception.Then the sys cpu 
 usage on namenode down.
 BTW, my hadoop 2.3.0 enaable hadoop acl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (HDFS-8853) Erasure Coding: Provide ECSchema validation when creating ECZone


 [ 
https://issues.apache.org/jira/browse/HDFS-8853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R reassigned HDFS-8853:
--

Assignee: Rakesh R

 Erasure Coding: Provide ECSchema validation when creating ECZone
 

 Key: HDFS-8853
 URL: https://issues.apache.org/jira/browse/HDFS-8853
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Rakesh R
Assignee: Rakesh R

 Presently the {{DFS#createErasureCodingZone(path, ecSchema, cellSize)}} 
 doesn't have any validation that the given {{ecSchema}} is available in 
 {{ErasureCodingSchemaManager#activeSchemas}} list. Now, if it doesn't exists 
 then will create the ECZone with {{null}} schema. IMHO we could improve this 
 by doing necessary basic sanity checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8853) Erasure Coding: Provide ECSchema validation when creating ECZone


 [ 
https://issues.apache.org/jira/browse/HDFS-8853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated HDFS-8853:
---
Assignee: J.Andreina  (was: Rakesh R)

 Erasure Coding: Provide ECSchema validation when creating ECZone
 

 Key: HDFS-8853
 URL: https://issues.apache.org/jira/browse/HDFS-8853
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Rakesh R
Assignee: J.Andreina

 Presently the {{DFS#createErasureCodingZone(path, ecSchema, cellSize)}} 
 doesn't have any validation that the given {{ecSchema}} is available in 
 {{ErasureCodingSchemaManager#activeSchemas}} list. Now, if it doesn't exists 
 then will create the ECZone with {{null}} schema. IMHO we could improve this 
 by doing necessary basic sanity checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8220) Erasure Coding: StripedDataStreamer fails to handle the blocklocations which doesn't satisfy BlockGroupSize


[ 
https://issues.apache.org/jira/browse/HDFS-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653155#comment-14653155
 ] 

Rakesh R commented on HDFS-8220:


Thanks [~szetszwo], [~walter.k.su]. Attached another patch addressing the 
comments.

 Erasure Coding: StripedDataStreamer fails to handle the blocklocations which 
 doesn't satisfy BlockGroupSize
 ---

 Key: HDFS-8220
 URL: https://issues.apache.org/jira/browse/HDFS-8220
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Rakesh R
Assignee: Rakesh R
 Attachments: HDFS-8220-001.patch, HDFS-8220-002.patch, 
 HDFS-8220-003.patch, HDFS-8220-004.patch, HDFS-8220-HDFS-7285-09.patch, 
 HDFS-8220-HDFS-7285-10.patch, HDFS-8220-HDFS-7285.005.patch, 
 HDFS-8220-HDFS-7285.006.patch, HDFS-8220-HDFS-7285.007.patch, 
 HDFS-8220-HDFS-7285.007.patch, HDFS-8220-HDFS-7285.008.patch


 During write operations {{StripedDataStreamer#locateFollowingBlock}} fails to 
 validate the available datanodes against the {{BlockGroupSize}}. Please see 
 the exception to understand more:
 {code}
 2015-04-22 14:56:11,313 WARN  hdfs.DFSClient (DataStreamer.java:run(538)) - 
 DataStreamer Exception
 java.lang.NullPointerException
   at 
 java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:374)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.locateFollowingBlock(StripedDataStreamer.java:157)
   at 
 org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1332)
   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:424)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:1)
 2015-04-22 14:56:11,313 INFO  hdfs.MiniDFSCluster 
 (MiniDFSCluster.java:shutdown(1718)) - Shutting down the Mini HDFS Cluster
 2015-04-22 14:56:11,313 ERROR hdfs.DFSClient 
 (DFSClient.java:closeAllFilesBeingWritten(608)) - Failed to close inode 16387
 java.io.IOException: DataStreamer Exception: 
   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:544)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:1)
 Caused by: java.lang.NullPointerException
   at 
 java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:374)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.locateFollowingBlock(StripedDataStreamer.java:157)
   at 
 org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1332)
   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:424)
   ... 1 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8851) datanode fails to start due to a bad disk

2015-08-04 Thread Wang Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653205#comment-14653205
 ] 

Wang Hao commented on HDFS-8851:


data12 is the bad disk
[hadoop@hadoop070 data12]$ ll
ls: reading directory .: Input/output error
total 0

 datanode fails to start due to a bad disk
 -

 Key: HDFS-8851
 URL: https://issues.apache.org/jira/browse/HDFS-8851
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.5.1
Reporter: Wang Hao

 Data node can not start due to a bad disk. I found a similar issue HDFS-6245 
 is reported, but our situation is different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails


[ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653256#comment-14653256
 ] 

Hadoop QA commented on HDFS-8704:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 30s | Findbugs (version ) appears to 
be broken on HDFS-7285. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 34s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 35s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 14s | The applied patch generated 
1 release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 37s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 38s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   3m 25s | The patch appears to introduce 6 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 13s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests | 174m 30s | Tests failed in hadoop-hdfs. |
| | | 216m 54s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.hdfs.TestCrcCorruption |
|   | hadoop.hdfs.TestLeaseRecovery2 |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
|   | hadoop.hdfs.server.namenode.TestFileTruncate |
|   | hadoop.hdfs.TestWriteStripedFileWithFailure |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748603/HDFS-8704-HDFS-7285-004.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | HDFS-7285 / 9312b16 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11894/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11894/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11894/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11894/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11894/console |


This message was automatically generated.

 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8854) Erasure coding: Move cellSize inside ECSchema


[ 
https://issues.apache.org/jira/browse/HDFS-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653252#comment-14653252
 ] 

Walter Su commented on HDFS-8854:
-

In the world of HDFS, cellSize always goes around with ecSchema. Why not just 
move cellSize inside ecSchema?

Advantages:
1. shorter args list, cleaner code.
2. store one field in BlockInfoStriped, instead of two fields. Less memory 
usage.

Other influences(not disadvantages, just influences):
1. cellSize is not needed by coder in hadoop-common. It doesn't matter. Coder 
can just ignore it.
2. cellSize is configured in server side, instead of client side. cellSize is 
not given by user. Arbitrary values of cellSize, like 63k, 67k is meaningless. 
cellSize can be 64k, 128k, 256k,... ecSchema is loaded from file in server 
side, together with cellSize. 

 Erasure coding: Move cellSize inside ECSchema
 -

 Key: HDFS-8854
 URL: https://issues.apache.org/jira/browse/HDFS-8854
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Walter Su
Assignee: Walter Su





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8762) Erasure Coding: the log of each streamer should show its index


[ 
https://issues.apache.org/jira/browse/HDFS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653299#comment-14653299
 ] 

Li Bo commented on HDFS-8762:
-

Move this jira from HDFS-8031 to HDFS-7285. Because there're 9 data streamers 
in striped writing, it's difficult to find the error if you don't know which 
streamer the log is generated from.

 Erasure Coding: the log of each streamer should show its index
 --

 Key: HDFS-8762
 URL: https://issues.apache.org/jira/browse/HDFS-8762
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8762-HDFS-7285-001.patch


 The log in {{DataStreamer}} doesn't show which streamer it's generated from. 
 In order to make log information more convenient for debugging, each log 
 should include the index of the streamer it's generated from. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8762) Erasure Coding: the log of each streamer should show its index


[ 
https://issues.apache.org/jira/browse/HDFS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1465#comment-1465
 ] 

Li Bo commented on HDFS-8762:
-

hi, [~zhz]
The Jenkins test of HDFS-8704 fails, but I can't reproduce it in my 
environment. It's difficult to find the reason from the log strings. So I move 
this jira to HDFS-7285 and hope it can be committed recently. Could you review 
the patch if you have time? The patch changes a little logic, just wrap each 
log string. Thanks.

 Erasure Coding: the log of each streamer should show its index
 --

 Key: HDFS-8762
 URL: https://issues.apache.org/jira/browse/HDFS-8762
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8762-HDFS-7285-001.patch, 
 HDFS-8762-HDFS-7285-002.patch


 The log in {{DataStreamer}} doesn't show which streamer it's generated from. 
 In order to make log information more convenient for debugging, each log 
 should include the index of the streamer it's generated from. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8220) Erasure Coding: StripedDataStreamer fails to handle the blocklocations which doesn't satisfy BlockGroupSize


[ 
https://issues.apache.org/jira/browse/HDFS-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653356#comment-14653356
 ] 

Hadoop QA commented on HDFS-8220:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  15m 45s | Findbugs (version ) appears to 
be broken on HDFS-7285. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   9m 45s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  12m  3s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 18s | The applied patch generated 
1 release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 40s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 55s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 42s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   4m 18s | The patch appears to introduce 5 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   4m 15s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests | 118m  0s | Tests failed in hadoop-hdfs. |
| | | 167m 48s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.hdfs.TestWriteStripedFileWithFailure |
|   | hadoop.hdfs.server.namenode.TestFileTruncate |
|   | hadoop.hdfs.TestReplaceDatanodeOnFailure |
| Timed out tests | org.apache.hadoop.hdfs.TestDFSClientRetries |
|   | org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748614/HDFS-8220-HDFS-7285-10.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | HDFS-7285 / 9312b16 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11897/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11897/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11897/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11897/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11897/console |


This message was automatically generated.

 Erasure Coding: StripedDataStreamer fails to handle the blocklocations which 
 doesn't satisfy BlockGroupSize
 ---

 Key: HDFS-8220
 URL: https://issues.apache.org/jira/browse/HDFS-8220
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Rakesh R
Assignee: Rakesh R
 Attachments: HDFS-8220-001.patch, HDFS-8220-002.patch, 
 HDFS-8220-003.patch, HDFS-8220-004.patch, HDFS-8220-HDFS-7285-09.patch, 
 HDFS-8220-HDFS-7285-10.patch, HDFS-8220-HDFS-7285.005.patch, 
 HDFS-8220-HDFS-7285.006.patch, HDFS-8220-HDFS-7285.007.patch, 
 HDFS-8220-HDFS-7285.007.patch, HDFS-8220-HDFS-7285.008.patch


 During write operations {{StripedDataStreamer#locateFollowingBlock}} fails to 
 validate the available datanodes against the {{BlockGroupSize}}. Please see 
 the exception to understand more:
 {code}
 2015-04-22 14:56:11,313 WARN  hdfs.DFSClient (DataStreamer.java:run(538)) - 
 DataStreamer Exception
 java.lang.NullPointerException
   at 
 java.util.concurrent.LinkedBlockingQueue.offer(LinkedBlockingQueue.java:374)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.locateFollowingBlock(StripedDataStreamer.java:157)
   at 
 org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1332)
   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:424)
   at 
 org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:1)
 2015-04-22 14:56:11,313 INFO  hdfs.MiniDFSCluster 
 (MiniDFSCluster.java:shutdown(1718)) - Shutting down the Mini HDFS Cluster
 2015-04-22 14:56:11,313 ERROR hdfs.DFSClient 
 (DFSClient.java:closeAllFilesBeingWritten(608)) - Failed to close inode 16387
 java.io.IOException: DataStreamer Exception: 
   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:544)
   at

[jira] [Commented] (HDFS-8704) Erasure Coding: client fails to write large file when one datanode fails


[ 
https://issues.apache.org/jira/browse/HDFS-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653392#comment-14653392
 ] 

Hadoop QA commented on HDFS-8704:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  16m 12s | Findbugs (version ) appears to 
be broken on HDFS-7285. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   8m  6s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 18s | There were no new javadoc 
warning messages. |
| {color:red}-1{color} | release audit |   0m 15s | The applied patch generated 
1 release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 41s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 41s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   3m 35s | The patch appears to introduce 6 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 20s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests | 179m 10s | Tests failed in hadoop-hdfs. |
| | | 223m 58s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | hadoop.hdfs.server.namenode.TestFileTruncate |
|   | hadoop.hdfs.TestWriteStripedFileWithFailure |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12748610/HDFS-8704-HDFS-7285-005.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | HDFS-7285 / 9312b16 |
| Release Audit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11895/artifact/patchprocess/patchReleaseAuditProblems.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11895/artifact/patchprocess/whitespace.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11895/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11895/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11895/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11895/console |


This message was automatically generated.

 Erasure Coding: client fails to write large file when one datanode fails
 

 Key: HDFS-8704
 URL: https://issues.apache.org/jira/browse/HDFS-8704
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8704-000.patch, HDFS-8704-HDFS-7285-002.patch, 
 HDFS-8704-HDFS-7285-003.patch, HDFS-8704-HDFS-7285-004.patch, 
 HDFS-8704-HDFS-7285-005.patch


 I test current code on a 5-node cluster using RS(3,2).  When a datanode is 
 corrupt, client succeeds to write a file smaller than a block group but fails 
 to write a large one. {{TestDFSStripeOutputStreamWithFailure}} only tests 
 files smaller than a block group, this jira will add more test situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks


[ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653492#comment-14653492
 ] 

Hudson commented on HDFS-8850:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1007 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1007/])
HDFS-8850. VolumeScanner thread exits with exception if there is no block pool 
to be scanned but there are suspicious blocks. (Colin Patrick McCabe via yliu) 
(yliu: rev 9a0806824e8982ec3b47315f9b960474186a)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java


 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8762) Erasure Coding: the log of each streamer should show its index


 [ 
https://issues.apache.org/jira/browse/HDFS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8762:

Parent Issue: HDFS-7285  (was: HDFS-8031)

 Erasure Coding: the log of each streamer should show its index
 --

 Key: HDFS-8762
 URL: https://issues.apache.org/jira/browse/HDFS-8762
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8762-HDFS-7285-001.patch


 The log in {{DataStreamer}} doesn't show which streamer it's generated from. 
 In order to make log information more convenient for debugging, each log 
 should include the index of the streamer it's generated from. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks


[ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653635#comment-14653635
 ] 

Hudson commented on HDFS-8850:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2204 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2204/])
HDFS-8850. VolumeScanner thread exits with exception if there is no block pool 
to be scanned but there are suspicious blocks. (Colin Patrick McCabe via yliu) 
(yliu: rev 9a0806824e8982ec3b47315f9b960474186a)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks


[ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653756#comment-14653756
 ] 

Hudson commented on HDFS-8850:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #266 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/266/])
HDFS-8850. VolumeScanner thread exits with exception if there is no block pool 
to be scanned but there are suspicious blocks. (Colin Patrick McCabe via yliu) 
(yliu: rev 9a0806824e8982ec3b47315f9b960474186a)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java


 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8854) Erasure coding: Move cellSize inside ECSchema

Walter Su created HDFS-8854:
---

 Summary: Erasure coding: Move cellSize inside ECSchema
 Key: HDFS-8854
 URL: https://issues.apache.org/jira/browse/HDFS-8854
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Walter Su
Assignee: Walter Su






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks

2015-08-04 Thread Yi Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653242#comment-14653242
 ] 

Yi Liu commented on HDFS-8850:
--

Test failure is unrelated. Committed to trunk and branch-2.
Thanks [~cmccabe] for the contribution.

 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks


[ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653253#comment-14653253
 ] 

Hudson commented on HDFS-8850:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8259 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8259/])
HDFS-8850. VolumeScanner thread exits with exception if there is no block pool 
to be scanned but there are suspicious blocks. (Colin Patrick McCabe via yliu) 
(yliu: rev 9a0806824e8982ec3b47315f9b960474186a)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks

2015-08-04 Thread Yi Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Liu updated HDFS-8850:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8762) Erasure Coding: the log of each streamer should show its index


 [ 
https://issues.apache.org/jira/browse/HDFS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8762:

Attachment: HDFS-8762-HDFS-7285-002.patch

 Erasure Coding: the log of each streamer should show its index
 --

 Key: HDFS-8762
 URL: https://issues.apache.org/jira/browse/HDFS-8762
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8762-HDFS-7285-001.patch, 
 HDFS-8762-HDFS-7285-002.patch


 The log in {{DataStreamer}} doesn't show which streamer it's generated from. 
 In order to make log information more convenient for debugging, each log 
 should include the index of the streamer it's generated from. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-1148) Convert FSDataset to ReadWriteLock

2015-08-04 Thread Yong Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Zhang updated HDFS-1148:
-
Attachment: HDFS-1148.001.patch

First new path

 Convert FSDataset to ReadWriteLock
 --

 Key: HDFS-1148
 URL: https://issues.apache.org/jira/browse/HDFS-1148
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, performance
Reporter: Todd Lipcon
Assignee: Yong Zhang
 Attachments: HDFS-1148.001.patch, hdfs-1148-old.txt, 
 hdfs-1148-trunk.txt, patch-HDFS-1148-rel0.20.2.txt


 In benchmarking HDFS-941 I noticed that for the random read workload, the 
 FSDataset lock is highly contended. After converting it to a 
 ReentrantReadWriteLock, I saw a ~25% improvement on both latency and 
 ops/second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8775) SASL support for data transfer protocol in libhdfspp

2015-08-04 Thread James Clampffer (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653801#comment-14653801
 ] 

James Clampffer commented on HDFS-8775:
---

I'm still learning more about the actual security protocol so I'll have to wait 
until I can make a good judgement there.  In the meantime is there a reason to 
alternate taking std::strings as parameters in the same function both by ptr or 
ref? 

e.g.
class DigestMD5Authenticator {
...
private:
  static size_t NextToken(const std::string payload, size_t off,
  std::string *tok);

I think it would make things more maintainable to pick one and stick with it; 
references are typically implemented as pointers so the performance should be 
the same in either case.

 SASL support for data transfer protocol in libhdfspp
 

 Key: HDFS-8775
 URL: https://issues.apache.org/jira/browse/HDFS-8775
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Haohui Mai
Assignee: Haohui Mai
 Attachments: HDFS-8775.000.patch


 This jira proposes to implement basic SASL support for the data transfer 
 protocol which allows libhdfspp to talk to secure clusters.
 Support for encryption is deferred to subsequent jiras.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small

2015-08-04 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HDFS-8838:
--
Attachment: h8838_20150804.patch

h8838_20150804.patch: reverts MiniDFSCluster change and adds public to TestBase.

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, 
 h8838_20150731.log, h8838_20150731.patch, h8838_20150804.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8856) Make LeaseManager#countPath O(1)

Arpit Agarwal created HDFS-8856:
---

 Summary: Make LeaseManager#countPath O(1)
 Key: HDFS-8856
 URL: https://issues.apache.org/jira/browse/HDFS-8856
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.2
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal


{{LeaseManager#countPath}} loops over all existing lease holders to compute the 
pending lease count. We can just track the pending leased files so it runs in 
constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8856) Make LeaseManager#countPath O(1)


 [ 
https://issues.apache.org/jira/browse/HDFS-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-8856:

Status: Patch Available  (was: Open)

 Make LeaseManager#countPath O(1)
 

 Key: HDFS-8856
 URL: https://issues.apache.org/jira/browse/HDFS-8856
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.2
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: HDFS-8856.01.patch


 {{LeaseManager#countPath}} loops over all existing lease holders to compute 
 the pending lease count. We can just track the pending leased files so it 
 runs in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8856) Make LeaseManager#countPath O(1)


 [ 
https://issues.apache.org/jira/browse/HDFS-8856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-8856:

Attachment: HDFS-8856.01.patch

Also remove redundant call to {{getNumUnderConstructionBlocks}} on startup.

 Make LeaseManager#countPath O(1)
 

 Key: HDFS-8856
 URL: https://issues.apache.org/jira/browse/HDFS-8856
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.7.2
Reporter: Arpit Agarwal
Assignee: Arpit Agarwal
 Attachments: HDFS-8856.01.patch


 {{LeaseManager#countPath}} loops over all existing lease holders to compute 
 the pending lease count. We can just track the pending leased files so it 
 runs in constant time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small

2015-08-04 Thread Tsz Wo Nicholas Sze (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz Wo Nicholas Sze updated HDFS-8838:
--
Status: Patch Available  (was: Open)

 Tolerate datanode failures in DFSStripedOutputStream when the data length is 
 small
 --

 Key: HDFS-8838
 URL: https://issues.apache.org/jira/browse/HDFS-8838
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
 Attachments: h8838_20150729.patch, h8838_20150731-HDFS-7285.patch, 
 h8838_20150731.log, h8838_20150731.patch, h8838_20150804-HDFS-7285.patch, 
 h8838_20150804.patch


 Currently, DFSStripedOutputStream cannot tolerate datanode failures when the 
 data length is small.  We fix the bugs here and add more tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8772) fix TestStandbyIsHot#testDatanodeRestarts which occasionally fails


 [ 
https://issues.apache.org/jira/browse/HDFS-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walter Su updated HDFS-8772:

Attachment: HDFS-8772.04.patch

And I'm pretty sure waitFor 3s is enough. TestProcessCorruptBlocks even sleep 
3s to wait full BR. I changed it to 10s to make sure, according to 
[~brahmareddy]'s comment.

 fix TestStandbyIsHot#testDatanodeRestarts which occasionally fails  
 

 Key: HDFS-8772
 URL: https://issues.apache.org/jira/browse/HDFS-8772
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Walter Su
Assignee: Walter Su
 Attachments: HDFS-8772.01.patch, HDFS-8772.02.patch, 
 HDFS-8772.03.patch, HDFS-8772.04.patch


 https://builds.apache.org/job/PreCommit-HDFS-Build/11596/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11598/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11599/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11600/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11606/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11608/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11612/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11618/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11650/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11655/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11659/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11663/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11664/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11667/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11669/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11676/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11677/testReport/
 {noformat}
 java.lang.AssertionError: expected:0 but was:4
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyIsHot.testDatanodeRestarts(TestStandbyIsHot.java:188)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7865) NullPointerException in SimulatedFSDataset

2015-08-04 Thread Brahma Reddy Battula (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654775#comment-14654775
 ] 

Brahma Reddy Battula commented on HDFS-7865:


HDFS-7990 had fixed this issue..

 NullPointerException in SimulatedFSDataset
 --

 Key: HDFS-7865
 URL: https://issues.apache.org/jira/browse/HDFS-7865
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Tsz Wo Nicholas Sze
Assignee: Brahma Reddy Battula
Priority: Minor

 https://builds.apache.org/job/PreCommit-HDFS-Build/9690//testReport/org.apache.hadoop.hdfs.server.balancer/TestBalancer/testUnknownDatanode/
 {noformat}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset$BInfo.access$400(SimulatedFSDataset.java:126)
   at 
 org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getPinning(SimulatedFSDataset.java:1319)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:969)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:244)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:249)
   at java.lang.Thread.run(Thread.java:745)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8772) fix TestStandbyIsHot#testDatanodeRestarts which occasionally fails

2015-08-04 Thread Brahma Reddy Battula (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654669#comment-14654669
 ] 

Brahma Reddy Battula commented on HDFS-8772:


Actual testcase may cost 4.4 sec, if jenkin machine overloaded, i think, only 
we can bump the wait time to 10 sec is sufficient.

 fix TestStandbyIsHot#testDatanodeRestarts which occasionally fails  
 

 Key: HDFS-8772
 URL: https://issues.apache.org/jira/browse/HDFS-8772
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Walter Su
Assignee: Walter Su
 Attachments: HDFS-8772.01.patch, HDFS-8772.02.patch, 
 HDFS-8772.03.patch


 https://builds.apache.org/job/PreCommit-HDFS-Build/11596/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11598/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11599/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11600/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11606/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11608/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11612/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11618/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11650/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11655/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11659/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11663/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11664/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11667/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11669/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11676/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11677/testReport/
 {noformat}
 java.lang.AssertionError: expected:0 but was:4
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyIsHot.testDatanodeRestarts(TestStandbyIsHot.java:188)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8827) Erasure Coding: When namenode processes over replicated striped block, NPE will be occur in ReplicationMonitor

2015-08-04 Thread Takuya Fukudome (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654707#comment-14654707
 ] 

Takuya Fukudome commented on HDFS-8827:
---

Could you check surefire-reports log files which mvn test generated? I have ran 
command {{mvn test 
-Dtest=org.apache.hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks}}
 and I found the NPE in 
{{hadoop-hdfs/target/surefire-reports/org.apache.hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks-output.txt}}.
 If there is a better way to write this test, please let me know. Thank you.

 Erasure Coding: When namenode processes over replicated striped block, NPE 
 will be occur in ReplicationMonitor
 --

 Key: HDFS-8827
 URL: https://issues.apache.org/jira/browse/HDFS-8827
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Takuya Fukudome
Assignee: Takuya Fukudome
 Attachments: HDFS-8827.1.patch, processing-over-replica-npe.log


 In our test cluster, when namenode processed over replicated striped blocks, 
 null pointer exception(NPE) occurred. This happened under below situation: 1) 
 some datanodes shutdown. 2) namenode recovers block group which lost internal 
 blocks. 3) restart the stopped datanodes. 4) namenode processes over 
 replicated striped blocks. 5) NPE occurs
 I think BlockPlacementPolicyDefault#chooseReplicaToDelete will return null in 
 this situation which causes this NPE problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8849) fsck should report number of missing blocks with replication factor 1

2015-08-04 Thread Allen Wittenauer (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654706#comment-14654706
]

Allen Wittenauer commented on HDFS-8849:

bq. If a DN then goes offline that was containing some TeraSort output, then
blocks appear missing and users get concerned because they see missing blocks
on the NN web UI and via dfsadmin -report/fsck, but it's not obvious that those
blocks were in fact set to replication factor 1.

You don't have to explain this issue to me because, see, I've actually
supported the same Hadoop systems long term and not just visiting
every-so-often. In fact, let me explain something to you: try this experiment
with min repl=2 and see what happens.

But I'll save you some time: if you think min repl=1 is confusing, min repl=2
is worse. fsck pre-the min repl message gives you exactly *zero* direct
information. You end up having to do a lot of funky math and counting messages
in the fsck missing block output to figure out what is going on. Because the
summary says all blocks are accounted for and healthy but the NN won't come
out of safemode. Unless you know the NN is waiting for these blocks to appear,
it's pure panic.

Now because we actually hit this issue, on machines we run and support, I
actually filed the JIRA to get the min repl block count missing message added.
So I'm *intimately* familiar with the issue. It's not 3rd hand from a 2nd tier
support person or from a random JIRA issue. It's not theoretical. The message
that fsck pumps out (at least in trunk; I don't follow branch-2) gives
*exactly* the information an ops person needs to know that X blocks are below
that minimal replication number, whether it be 1, 2, or 10. They can take that
information and know how many blocks they are on the hunt for and if the fsck
reports healthy, they know they can force it out of safemode and let the NN do
the replication itself.

... and let's be clear: the vast majority of people who are running fsck are
operations people and they are almost certainly doing it as either part of
their maintenance or when stuff breaks. Ignoring the 2 people in a garage
scenario, the vast majority of users are completely ignorant about fsck. They
are almost certainly completely unaware that the tool exists and go running to
the ops team if Hadoop is down.

bq. Separately, using phrases like Meanwhile, back in real life and calling
a proposed improvement a useless feature is not an appropriate way to
communicate in this forum.

I'm sticking with those comments unless you can give an example that isn't
teragen. Because my real world, not in a lab, talking with users and
operations folks on a regular basis experience says a purposefully set repl=1
that isn't teragen is almost always about avoiding quota. teragen has *always*
been a bad actor on the system and we're MUCH better off setting the default
min repl 2. Yes, this will likely break QA and single node test systems. We
*seriously* need to get past this idea that we expect production people to
change our idiotic defaults because it's inconvenient for builds that will only
be up for a few hours.

fsck should report number of missing blocks with replication factor 1
-

Key: HDFS-8849
URL: https://issues.apache.org/jira/browse/HDFS-8849
Project: Hadoop HDFS
Issue Type: Improvement
Components: tools
Affects Versions: 2.7.1
Reporter: Zhe Zhang
Assignee: Zhe Zhang
Priority: Minor

HDFS-7165 supports reporting number of blocks with replication factor 1 in
{{dfsadmin}} and NN metrics. But it didn't extend {{fsck}} with the same
support, which is the aim of this JIRA.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-1148) Convert FSDataset to ReadWriteLock

2015-08-04 Thread Yong Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Zhang updated HDFS-1148:
-
Status: Patch Available  (was: Open)

 Convert FSDataset to ReadWriteLock
 --

 Key: HDFS-1148
 URL: https://issues.apache.org/jira/browse/HDFS-1148
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, performance
Reporter: Todd Lipcon
Assignee: Yong Zhang
 Attachments: HDFS-1148.001.patch, hdfs-1148-old.txt, 
 hdfs-1148-trunk.txt, patch-HDFS-1148-rel0.20.2.txt


 In benchmarking HDFS-941 I noticed that for the random read workload, the 
 FSDataset lock is highly contended. After converting it to a 
 ReentrantReadWriteLock, I saw a ~25% improvement on both latency and 
 ops/second.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8127) NameNode Failover during HA upgrade can cause DataNode to finalize upgrade


[ 
https://issues.apache.org/jira/browse/HDFS-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654360#comment-14654360
 ] 

Chris Trezzo commented on HDFS-8127:


Note that this patch uses the checkNNStartup() method added in HDFS-3443.

 NameNode Failover during HA upgrade can cause DataNode to finalize upgrade
 --

 Key: HDFS-8127
 URL: https://issues.apache.org/jira/browse/HDFS-8127
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Blocker
  Labels: 2.6.1-candidate
 Fix For: 2.7.1

 Attachments: HDFS-8127.000.patch, HDFS-8127.001.patch


 Currently for HA upgrade (enabled by HDFS-5138), we use {{-bootstrapStandby}} 
 to initialize the standby NameNode. The standby NameNode does not have the 
 {{previous}} directory thus it does not know that the cluster is in the 
 upgrade state. If NN failover happens, as response of block reports, the new 
 ANN will tell DNs to finalize the upgrade thus make it impossible to rollback 
 again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8550) Erasure Coding: Fix FindBugs Multithreaded correctness Warning


[ 
https://issues.apache.org/jira/browse/HDFS-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654475#comment-14654475
 ] 

Zhe Zhang commented on HDFS-8550:
-

Please use this [link | 
https://builds.apache.org/job/PreCommit-HDFS-Build/11886/artifact/patchprocess/patchFindbugsWarningshadoop-hdfs.html].
 

Does it give any more hint on how to fix the warnings?

 Erasure Coding: Fix FindBugs Multithreaded correctness Warning
 --

 Key: HDFS-8550
 URL: https://issues.apache.org/jira/browse/HDFS-8550
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Rakesh R
Assignee: Rakesh R

 Findbug warning:- Inconsistent synchronization of 
 org.apache.hadoop.hdfs.DFSOutputStream.streamer; locked 89% of time
 {code}
 Bug type IS2_INCONSISTENT_SYNC (click for details) 
 In class org.apache.hadoop.hdfs.DFSOutputStream
 Field org.apache.hadoop.hdfs.DFSOutputStream.streamer
 Synchronized 89% of the time
 Unsynchronized access at DFSOutputStream.java:[line 146]
 Unsynchronized access at DFSOutputStream.java:[line 859]
 Unsynchronized access at DFSOutputStream.java:[line 627]
 Unsynchronized access at DFSOutputStream.java:[line 630]
 Unsynchronized access at DFSOutputStream.java:[line 640]
 Unsynchronized access at DFSOutputStream.java:[line 342]
 Unsynchronized access at DFSOutputStream.java:[line 744]
 Unsynchronized access at DFSOutputStream.java:[line 903]
 Synchronized access at DFSOutputStream.java:[line 737]
 Synchronized access at DFSOutputStream.java:[line 913]
 Synchronized access at DFSOutputStream.java:[line 726]
 Synchronized access at DFSOutputStream.java:[line 756]
 Synchronized access at DFSOutputStream.java:[line 762]
 Synchronized access at DFSOutputStream.java:[line 757]
 Synchronized access at DFSOutputStream.java:[line 758]
 Synchronized access at DFSOutputStream.java:[line 762]
 Synchronized access at DFSOutputStream.java:[line 483]
 Synchronized access at DFSOutputStream.java:[line 486]
 Synchronized access at DFSOutputStream.java:[line 717]
 Synchronized access at DFSOutputStream.java:[line 719]
 Synchronized access at DFSOutputStream.java:[line 722]
 Synchronized access at DFSOutputStream.java:[line 408]
 Synchronized access at DFSOutputStream.java:[line 408]
 Synchronized access at DFSOutputStream.java:[line 423]
 Synchronized access at DFSOutputStream.java:[line 426]
 Synchronized access at DFSOutputStream.java:[line 411]
 Synchronized access at DFSOutputStream.java:[line 452]
 Synchronized access at DFSOutputStream.java:[line 452]
 Synchronized access at DFSOutputStream.java:[line 439]
 Synchronized access at DFSOutputStream.java:[line 439]
 Synchronized access at DFSOutputStream.java:[line 439]
 Synchronized access at DFSOutputStream.java:[line 670]
 Synchronized access at DFSOutputStream.java:[line 580]
 Synchronized access at DFSOutputStream.java:[line 574]
 Synchronized access at DFSOutputStream.java:[line 592]
 Synchronized access at DFSOutputStream.java:[line 583]
 Synchronized access at DFSOutputStream.java:[line 581]
 Synchronized access at DFSOutputStream.java:[line 621]
 Synchronized access at DFSOutputStream.java:[line 609]
 Synchronized access at DFSOutputStream.java:[line 621]
 Synchronized access at DFSOutputStream.java:[line 597]
 Synchronized access at DFSOutputStream.java:[line 612]
 Synchronized access at DFSOutputStream.java:[line 597]
 Synchronized access at DFSOutputStream.java:[line 588]
 Synchronized access at DFSOutputStream.java:[line 624]
 Synchronized access at DFSOutputStream.java:[line 612]
 Synchronized access at DFSOutputStream.java:[line 588]
 Synchronized access at DFSOutputStream.java:[line 632]
 Synchronized access at DFSOutputStream.java:[line 632]
 Synchronized access at DFSOutputStream.java:[line 616]
 Synchronized access at DFSOutputStream.java:[line 633]
 Synchronized access at DFSOutputStream.java:[line 657]
 Synchronized access at DFSOutputStream.java:[line 658]
 Synchronized access at DFSOutputStream.java:[line 695]
 Synchronized access at DFSOutputStream.java:[line 698]
 Synchronized access at DFSOutputStream.java:[line 784]
 Synchronized access at DFSOutputStream.java:[line 795]
 Synchronized access at DFSOutputStream.java:[line 801]
 Synchronized access at DFSOutputStream.java:[line 155]
 Synchronized access at DFSOutputStream.java:[line 158]
 Synchronized access at DFSOutputStream.java:[line 433]
 Synchronized access at DFSOutputStream.java:[line 886]
 Synchronized access at DFSOutputStream.java:[line 463]
 Synchronized access at DFSOutputStream.java:[line 469]
 Synchronized access at DFSOutputStream.java:[line 463]
 Synchronized access at DFSOutputStream.java:[line 470]
 Synchronized access at DFSOutputStream.java:[line 465]
 Synchronized access at DFSOutputStream.java:[line 749]
 Synchronized access at

[jira] [Commented] (HDFS-3443) Fix NPE when namenode transition to active during startup by adding checkNNStartup() in NameNodeRpcServer


[ 
https://issues.apache.org/jira/browse/HDFS-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654529#comment-14654529
 ] 

Chris Trezzo commented on HDFS-3443:


Note: Even though this is listed as a 2.6.1-candidate it is actually already 
checked into branch-2.6.

 Fix NPE when namenode transition to active during startup by adding 
 checkNNStartup() in NameNodeRpcServer
 -

 Key: HDFS-3443
 URL: https://issues.apache.org/jira/browse/HDFS-3443
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Reporter: suja s
Assignee: Vinayakumar B
  Labels: 2.6.1-candidate
 Fix For: 2.6.1

 Attachments: HDFS-3443-003.patch, HDFS-3443-004.patch, 
 HDFS-3443-005.patch, HDFS-3443-006.patch, HDFS-3443-007.patch, 
 HDFS-3443_1.patch, HDFS-3443_1.patch


 Start NN
 Let NN standby services be started.
 Before the editLogTailer is initialised start ZKFC and allow the 
 activeservices start to proceed further.
 Here editLogTailer.catchupDuringFailover() will throw NPE.
 void startActiveServices() throws IOException {
 LOG.info(Starting services required for active state);
 writeLock();
 try {
   FSEditLog editLog = dir.fsImage.getEditLog();
   
   if (!editLog.isOpenForWrite()) {
 // During startup, we're already open for write during initialization.
 editLog.initJournalsForWrite();
 // May need to recover
 editLog.recoverUnclosedStreams();
 
 LOG.info(Catching up to latest edits from old active before  +
 taking over writer role in edits logs.);
 editLogTailer.catchupDuringFailover();
 {noformat}
 2012-05-18 16:51:27,585 WARN org.apache.hadoop.ipc.Server: IPC Server 
 Responder, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 
 XX.XX.XX.55:58003: output error
 2012-05-18 16:51:27,586 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
 8 on 8020, call org.apache.hadoop.ha.HAServiceProtocol.transitionToActive 
 from XX.XX.XX.55:58004: error: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:602)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
   at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
   at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-18 16:51:27,586 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 9 on 8020 caught an exception
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
   at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2092)
   at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
   at 
 org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
   at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-3443) Fix NPE when namenode transition to active during startup by adding checkNNStartup() in NameNodeRpcServer


 [ 
https://issues.apache.org/jira/browse/HDFS-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-3443:

Labels:   (was: 2.6.1-candidate)

 Fix NPE when namenode transition to active during startup by adding 
 checkNNStartup() in NameNodeRpcServer
 -

 Key: HDFS-3443
 URL: https://issues.apache.org/jira/browse/HDFS-3443
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Reporter: suja s
Assignee: Vinayakumar B
 Fix For: 2.6.1

 Attachments: HDFS-3443-003.patch, HDFS-3443-004.patch, 
 HDFS-3443-005.patch, HDFS-3443-006.patch, HDFS-3443-007.patch, 
 HDFS-3443_1.patch, HDFS-3443_1.patch


 Start NN
 Let NN standby services be started.
 Before the editLogTailer is initialised start ZKFC and allow the 
 activeservices start to proceed further.
 Here editLogTailer.catchupDuringFailover() will throw NPE.
 void startActiveServices() throws IOException {
 LOG.info(Starting services required for active state);
 writeLock();
 try {
   FSEditLog editLog = dir.fsImage.getEditLog();
   
   if (!editLog.isOpenForWrite()) {
 // During startup, we're already open for write during initialization.
 editLog.initJournalsForWrite();
 // May need to recover
 editLog.recoverUnclosedStreams();
 
 LOG.info(Catching up to latest edits from old active before  +
 taking over writer role in edits logs.);
 editLogTailer.catchupDuringFailover();
 {noformat}
 2012-05-18 16:51:27,585 WARN org.apache.hadoop.ipc.Server: IPC Server 
 Responder, call org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 
 XX.XX.XX.55:58003: output error
 2012-05-18 16:51:27,586 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
 8 on 8020, call org.apache.hadoop.ha.HAServiceProtocol.transitionToActive 
 from XX.XX.XX.55:58004: error: java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:602)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1287)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:63)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1219)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:978)
   at 
 org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
   at 
 org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:3633)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686)
 2012-05-18 16:51:27,586 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
 9 on 8020 caught an exception
 java.nio.channels.ClosedChannelException
   at 
 sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
   at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
   at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2092)
   at org.apache.hadoop.ipc.Server.access$2000(Server.java:107)
   at 
 org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:930)
   at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:994)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1738)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8127) NameNode Failover during HA upgrade can cause DataNode to finalize upgrade


[ 
https://issues.apache.org/jira/browse/HDFS-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654363#comment-14654363
 ] 

Chris Trezzo commented on HDFS-8127:


(for backport to 2.6 branch)

 NameNode Failover during HA upgrade can cause DataNode to finalize upgrade
 --

 Key: HDFS-8127
 URL: https://issues.apache.org/jira/browse/HDFS-8127
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.4.0
Reporter: Jing Zhao
Assignee: Jing Zhao
Priority: Blocker
  Labels: 2.6.1-candidate
 Fix For: 2.7.1

 Attachments: HDFS-8127.000.patch, HDFS-8127.001.patch


 Currently for HA upgrade (enabled by HDFS-5138), we use {{-bootstrapStandby}} 
 to initialize the standby NameNode. The standby NameNode does not have the 
 {{previous}} directory thus it does not know that the cluster is in the 
 upgrade state. If NN failover happens, as response of block reports, the new 
 ANN will tell DNs to finalize the upgrade thus make it impossible to rollback 
 again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8399) Erasure Coding: unit test the behaviour of BlockManager recovery work for the deleted blocks


 [ 
https://issues.apache.org/jira/browse/HDFS-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhe Zhang updated HDFS-8399:

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: HDFS-7285
   Status: Resolved  (was: Patch Available)

Thanks Rakesh! +1 on the latest patch. I just committed to the branch.

 Erasure Coding: unit test the behaviour of BlockManager recovery work for the 
 deleted blocks
 

 Key: HDFS-8399
 URL: https://issues.apache.org/jira/browse/HDFS-8399
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Rakesh R
Assignee: Rakesh R
  Labels: Test
 Fix For: HDFS-7285

 Attachments: HDFS-8399-HDFS-7285-00.patch, 
 HDFS-8399-HDFS-7285-01.patch, HDFS-8399-HDFS-7285-02.patch


 Following exception occurred in the {{ReplicationMonitor}}. As per the 
 initial analysis, I could see the exception is coming for the blocks of the 
 deleted file.
 {code}
 2015-05-14 14:14:40,485 FATAL util.ExitUtil (ExitUtil.java:terminate(127)) - 
 Terminate called
 org.apache.hadoop.util.ExitUtil$ExitException: java.lang.AssertionError: 
 Absolute path required
   at 
 org.apache.hadoop.hdfs.server.namenode.INode.getPathNames(INode.java:744)
   at 
 org.apache.hadoop.hdfs.server.namenode.INode.getPathComponents(INode.java:723)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.getINodesInPath(FSDirectory.java:1655)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getECSchemaForPath(FSNamesystem.java:8435)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeRecoveryWorkForBlocks(BlockManager.java:1572)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockRecoveryWork(BlockManager.java:1402)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3894)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3846)
   at java.lang.Thread.run(Thread.java:722)
   at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126)
   at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3865)
   at java.lang.Thread.run(Thread.java:722)
 Exception in thread 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor@1255079
  org.apache.hadoop.util.ExitUtil$ExitException: java.lang.AssertionError: 
 Absolute path required
   at 
 org.apache.hadoop.hdfs.server.namenode.INode.getPathNames(INode.java:744)
   at 
 org.apache.hadoop.hdfs.server.namenode.INode.getPathComponents(INode.java:723)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.getINodesInPath(FSDirectory.java:1655)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getECSchemaForPath(FSNamesystem.java:8435)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeRecoveryWorkForBlocks(BlockManager.java:1572)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockRecoveryWork(BlockManager.java:1402)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3894)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3846)
   at java.lang.Thread.run(Thread.java:722)
   at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:126)
   at org.apache.hadoop.util.ExitUtil.terminate(ExitUtil.java:170)
   at 
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3865)
   at java.lang.Thread.run(Thread.java:722)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-08-04 Thread Nathan Roberts (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654426#comment-14654426
]

Nathan Roberts commented on HDFS-8791:
--

My preference would be to take a smaller incremental step.

How about:
- New layout where n x m levels are configurable (today 256x256)
- n x m is recorded in version file
- Upgrade path is taken if configured n x m is different from n x m in VERSION
file

Seems like most of the code will work without too much modification (and the
risk that comes with it).

I fear if we try to take too much of a step at this point, it will take
significant time to settle on the new layout, and then it will end up being
either extremely close to what we have now OR it will be radically different
and require a lot of investment of time and resources to even get there.

In other words, I think we need a short term layout change that is low-risk and
quick to integrate.

block ID-based DN storage layout can be very slow for datanode on ext4
--

Key: HDFS-8791
URL: https://issues.apache.org/jira/browse/HDFS-8791
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

We are seeing cases where the new directory layout causes the datanode to
basically cause the disks to seek for 10s of minutes. This can be when the
datanode is running du, and it can also be when it is performing a
checkDirs(). Both of these operations currently scan all directories in the
block pool and that's very expensive in the new layout.
The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K
leaf directories where block files are placed.
So, what we have on disk is:
- 256 inodes for the first level directories
- 256 directory blocks for the first level directories
- 256*256 inodes for the second level directories
- 256*256 directory blocks for the second level directories
- Then the inodes and blocks to store the the HDFS blocks themselves.
The main problem is the 256*256 directory blocks.
inodes and dentries will be cached by linux and one can configure how likely
the system is to prune those entries (vfs_cache_pressure). However, ext4
relies on the buffer cache to cache the directory blocks and I'm not aware of
any way to tell linux to favor buffer cache pages (even if it did I'm not
sure I would want it to in general).
Also, ext4 tries hard to spread directories evenly across the entire volume,
this basically means the 64K directory blocks are probably randomly spread
across the entire disk. A du type scan will look at directories one at a
time, so the ioscheduler can't optimize the corresponding seeks, meaning the
seeks will be random and far.
In a system I was using to diagnose this, I had 60K blocks. A DU when things
are hot is less than 1 second. When things are cold, about 20 minutes.
How do things get cold?
- A large set of tasks run on the node. This pushes almost all of the buffer
cache out, causing the next DU to hit this situation. We are seeing cases
where a large job can cause a seek storm across the entire cluster.
Why didn't the previous layout see this?
- It might have but it wasn't nearly as pronounced. The previous layout would
be a few hundred directory blocks. Even when completely cold, these would
only take a few a hundred seeks which would mean single digit seconds.
- With only a few hundred directories, the odds of the directory blocks
getting modified is quite high, this keeps those blocks hot and much less
likely to be evicted.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8833) Erasure coding: store EC schema and cell size in INodeFile and eliminate notion of EC zones


[ 
https://issues.apache.org/jira/browse/HDFS-8833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654521#comment-14654521
 ] 

Zhe Zhang commented on HDFS-8833:
-

Please note that HDFS-8854 has been created to tackle the 
{{ErasureCodingPolicy}} change (point 1 from the above list).

 Erasure coding: store EC schema and cell size in INodeFile and eliminate 
 notion of EC zones
 ---

 Key: HDFS-8833
 URL: https://issues.apache.org/jira/browse/HDFS-8833
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Affects Versions: HDFS-7285
Reporter: Zhe Zhang
Assignee: Zhe Zhang

 We have [discussed | 
 https://issues.apache.org/jira/browse/HDFS-7285?focusedCommentId=14357754page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14357754]
  storing EC schema with files instead of EC zones and recently revisited the 
 discussion under HDFS-8059.
 As a recap, the _zone_ concept has severe limitations including renaming and 
 nested configuration. Those limitations are valid in encryption for security 
 reasons and it doesn't make sense to carry them over in EC.
 This JIRA aims to store EC schema and cell size on {{INodeFile}} level. For 
 simplicity, we should first implement it as an xattr and consider memory 
 optimizations (such as moving it to file header) as a follow-on. We should 
 also disable changing EC policy on a non-empty file / dir in the first phase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8804) Erasure Coding: use DirectBufferPool in DFSStripedInputStream for buffer allocation


[ 
https://issues.apache.org/jira/browse/HDFS-8804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654496#comment-14654496
 ] 

Zhe Zhang commented on HDFS-8804:
-

Thanks Jing, sounds good to me. I'll try to do a follow-on to add some more 
checking in how {{StripedBlockUtil}} allocates {{buf}} to chunks.

 Erasure Coding: use DirectBufferPool in DFSStripedInputStream for buffer 
 allocation
 ---

 Key: HDFS-8804
 URL: https://issues.apache.org/jira/browse/HDFS-8804
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jing Zhao
Assignee: Jing Zhao
 Fix For: HDFS-7285

 Attachments: HDFS-8804.000.patch, HDFS-8804.001.patch


 Currently we directly allocate direct ByteBuffer in DFSStripedInputstream for 
 the stripe buffer and the buffers holding parity data. It's better to get 
 ByteBuffer from DirectBufferPool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8827) Erasure Coding: When namenode processes over replicated striped block, NPE will be occur in ReplicationMonitor


[ 
https://issues.apache.org/jira/browse/HDFS-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654793#comment-14654793
 ] 

Walter Su commented on HDFS-8827:
-

I saw the NPE and AssertionError.

It's not a bug in NN side. It's a bug in DFSTestUtil.createStripedFile(..); You 
hope to create a file with len=cellSize * 5; But the function actually creates 
a full stripe with len=cellSize*6.

 Erasure Coding: When namenode processes over replicated striped block, NPE 
 will be occur in ReplicationMonitor
 --

 Key: HDFS-8827
 URL: https://issues.apache.org/jira/browse/HDFS-8827
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Takuya Fukudome
Assignee: Takuya Fukudome
 Attachments: HDFS-8827.1.patch, processing-over-replica-npe.log


 In our test cluster, when namenode processed over replicated striped blocks, 
 null pointer exception(NPE) occurred. This happened under below situation: 1) 
 some datanodes shutdown. 2) namenode recovers block group which lost internal 
 blocks. 3) restart the stopped datanodes. 4) namenode processes over 
 replicated striped blocks. 5) NPE occurs
 I think BlockPlacementPolicyDefault#chooseReplicaToDelete will return null in 
 this situation which causes this NPE problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8815) DFS getStoragePolicy implementation using single RPC call

2015-08-04 Thread Surendra Singh Lilhore (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654807#comment-14654807
 ] 

Surendra Singh Lilhore commented on HDFS-8815:
--

[~vinayrpet] and [~arpitagarwal] could you please review ?
Failed test case is unrelated... 

 DFS getStoragePolicy implementation using single RPC call
 -

 Key: HDFS-8815
 URL: https://issues.apache.org/jira/browse/HDFS-8815
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 2.8.0
Reporter: Arpit Agarwal
Assignee: Surendra Singh Lilhore
 Attachments: HDFS-8815-001.patch, HDFS-8815-002.patch, 
 HDFS-8815-003.patch


 HADOOP-12161 introduced a new {{FileSystem#getStoragePolicy}} call. The DFS 
 implementation of the call requires two RPC calls, the first to fetch the 
 storage policy ID and the second to fetch the policy suite to map the policy 
 ID to a {{BlockStoragePolicySpi}}.
 Fix the implementation to require a single RPC call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8827) Erasure Coding: When namenode processes over replicated striped block, NPE will be occur in ReplicationMonitor


[ 
https://issues.apache.org/jira/browse/HDFS-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654834#comment-14654834
 ] 

Walter Su commented on HDFS-8827:
-

DFSTestUtil.createStripedFile(..) is used earlier because 
DFSStripedOutputStream is not finished yet.
Now you can just use DFSTestUtil.createFile(..). But after that you have to 
figure out the empty DNs from LocatedBlock. So you can inject the redundant 
blocks to those empty DNs.

 Erasure Coding: When namenode processes over replicated striped block, NPE 
 will be occur in ReplicationMonitor
 --

 Key: HDFS-8827
 URL: https://issues.apache.org/jira/browse/HDFS-8827
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Takuya Fukudome
Assignee: Takuya Fukudome
 Attachments: HDFS-8827.1.patch, processing-over-replica-npe.log


 In our test cluster, when namenode processed over replicated striped blocks, 
 null pointer exception(NPE) occurred. This happened under below situation: 1) 
 some datanodes shutdown. 2) namenode recovers block group which lost internal 
 blocks. 3) restart the stopped datanodes. 4) namenode processes over 
 replicated striped blocks. 5) NPE occurs
 I think BlockPlacementPolicyDefault#chooseReplicaToDelete will return null in 
 this situation which causes this NPE problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8815) DFS getStoragePolicy implementation using single RPC call

2015-08-04 Thread Vinayakumar B (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654838#comment-14654838
 ] 

Vinayakumar B commented on HDFS-8815:
-

Just one nit,
{{ private BlockStoragePolicy[] policies;}} seems unused.

+1 once addressed.

[~arpitagarwal], Do you want to take a look ?

 DFS getStoragePolicy implementation using single RPC call
 -

 Key: HDFS-8815
 URL: https://issues.apache.org/jira/browse/HDFS-8815
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 2.8.0
Reporter: Arpit Agarwal
Assignee: Surendra Singh Lilhore
 Attachments: HDFS-8815-001.patch, HDFS-8815-002.patch, 
 HDFS-8815-003.patch


 HADOOP-12161 introduced a new {{FileSystem#getStoragePolicy}} call. The DFS 
 implementation of the call requires two RPC calls, the first to fetch the 
 storage policy ID and the second to fetch the policy suite to map the policy 
 ID to a {{BlockStoragePolicySpi}}.
 Fix the implementation to require a single RPC call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8857) Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure

Li Bo created HDFS-8857:
---

 Summary: Erasure Coding: Fix ArrayIndexOutOfBoundsException in 
TestWriteStripedFileWithFailure
 Key: HDFS-8857
 URL: https://issues.apache.org/jira/browse/HDFS-8857
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8857) Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure


 [ 
https://issues.apache.org/jira/browse/HDFS-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8857:

Attachment: HDFS-8857-HDFS-7285-001.patch

 Erasure Coding: Fix ArrayIndexOutOfBoundsException in 
 TestWriteStripedFileWithFailure
 -

 Key: HDFS-8857
 URL: https://issues.apache.org/jira/browse/HDFS-8857
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8857-HDFS-7285-001.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8857) Erasure Coding: Fix ArrayIndexOutOfBoundsException in TestWriteStripedFileWithFailure


 [ 
https://issues.apache.org/jira/browse/HDFS-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Bo updated HDFS-8857:

Status: Patch Available  (was: Open)

 Erasure Coding: Fix ArrayIndexOutOfBoundsException in 
 TestWriteStripedFileWithFailure
 -

 Key: HDFS-8857
 URL: https://issues.apache.org/jira/browse/HDFS-8857
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8857-HDFS-7285-001.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8852) HDFS architecture documentation of version 2.x is outdated about append write support


 [ 
https://issues.apache.org/jira/browse/HDFS-8852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-8852:

Labels: newbie  (was: )

 HDFS architecture documentation of version 2.x is outdated about append write 
 support
 -

 Key: HDFS-8852
 URL: https://issues.apache.org/jira/browse/HDFS-8852
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: documentation
Reporter: Hong Dai Thanh
  Labels: newbie

 In the [latest version of the 
 documentation|http://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Simple_Coherency_Model],
  and also documentation for all releases with version 2, it’s mentioned that 
 “A file once created, written, and closed need not be changed. “ and “There 
 is a plan to support appending-writes to files in the future.” 
  
 However, as far as I know, HDFS has supported append write since 0.21, based 
 on [HDFS-265|https://issues.apache.org/jira/browse/HDFS-265] and [the old 
 version of the documentation in 
 2012|https://web.archive.org/web/20121221171824/http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html#Appending-Writes+and+File+Syncs]
 Various posts on the Internet also suggests that append write has been 
 available in HDFS, and will always be available in Hadoop version 2 branch.
  
 Can we update the documentation to reflect the current status?
 (Please also review whether the documentation should also be updated for 
 version 0.21 and above, and the version 1.x branch)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2015-08-04 Thread Nathan Roberts (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653974#comment-14653974
]

Nathan Roberts commented on HDFS-8791:
--

Curious what folks would think about going back to previous layout? I
understand there was some benefit to the new layout but maybe there are nearly
equivalent and less-intrusive ways to achieve the same benefits. I'm confident
the current layout is going to cause significant performance issues for HDFS,
and latency sensitive applications (e.g. Hbase) are going to feel this in a big
way.

block ID-based DN storage layout can be very slow for datanode on ext4
--

Key: HDFS-8791
URL: https://issues.apache.org/jira/browse/HDFS-8791
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7533) Datanode sometimes does not shutdown on receiving upgrade shutdown command


 [ 
https://issues.apache.org/jira/browse/HDFS-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated HDFS-7533:
---
Attachment: HDFS-7533-branch-2.6-v1.patch

Attached is a patch for branch 2.6. This was a trivial backport. I ran 
TestDataNodeExit and it passed.

 Datanode sometimes does not shutdown on receiving upgrade shutdown command
 --

 Key: HDFS-7533
 URL: https://issues.apache.org/jira/browse/HDFS-7533
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kihwal Lee
Assignee: Eric Payne
  Labels: 2.6.1-candidate
 Fix For: 2.7.0

 Attachments: HDFS-7533-branch-2.6-v1.patch, HDFS-7533.v1.txt


 When datanode is told to shutdown via the dfsadmin command during rolling 
 upgrade, it may not shutdown.  This is because not all writers have responder 
 running, but sendOOB() tries anyway. This causes NPE and the shutdown thread 
 dies, halting the shutdown after only shutting down DataXceiverServer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8747) Provide Better Scratch Space and Soft Delete Support for HDFS Encryption Zones

2015-08-04 Thread Andrew Wang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654141#comment-14654141
]

Andrew Wang commented on HDFS-8747:
---

Hi Xiaoyu,

Regarding snapshots, the idea would be to take snapshots on the EZ root.
Snapshots are typically scheduled on a periodic basis, so not a manual
operation. Basically a cron job that runs every hour to take a snapshot. I'll
note also that filesystems with snapshot support like ZFS and WAFL aren't
really used with trash, since snapshots are a superior solution for data
recovery.

Regarding nested encryption zones, one request we've had is setting / as an
encryption zone, then subdirs as different EZs. This guarantees that all data
in HDFS is encrypted, and gives the flexibility of using different EZ keys for
subdirs if desired.

Provide Better Scratch Space and Soft Delete Support for HDFS Encryption
Zones
--

Key: HDFS-8747
URL: https://issues.apache.org/jira/browse/HDFS-8747
Project: Hadoop HDFS
Issue Type: Bug
Components: encryption
Affects Versions: 2.6.0
Reporter: Xiaoyu Yao
Assignee: Xiaoyu Yao
Attachments: HDFS-8747-07092015.pdf, HDFS-8747-07152015.pdf,
HDFS-8747-07292015.pdf

HDFS Transparent Data Encryption At-Rest was introduced in Hadoop 2.6 to
allow create encryption zone on top of a single HDFS directory. Files under
the root directory of the encryption zone will be encrypted/decrypted
transparently upon HDFS client write or read operations.
Generally, it does not support rename(without data copying) across encryption
zones or between encryption zone and non-encryption zone because different
security settings of encryption zones. However, there are certain use cases
where efficient rename support is desired. This JIRA is to propose better
support of two such use cases “Scratch Space” (a.k.a. staging area) and “Soft
Delete” (a.k.a. trash) with HDFS encryption zones.
“Scratch Space” is widely used in Hadoop jobs, which requires efficient
rename support. Temporary files from MR jobs are usually stored in staging
area outside encryption zone such as “/tmp” directory and then rename to
targeted directories as specified once the data is ready to be further
processed.
Below is a summary of supported/unsupported cases from latest Hadoop:
* Rename within the encryption zone is supported
* Rename the entire encryption zone by moving the root directory of the zone
is allowed.
* Rename sub-directory/file from encryption zone to non-encryption zone is
not allowed.
* Rename sub-directory/file from encryption zone A to encryption zone B is
not allowed.
* Rename from non-encryption zone to encryption zone is not allowed.
“Soft delete” (a.k.a. trash) is a client-side “soft delete” feature that
helps prevent accidental deletion of files and directories. If trash is
enabled and a file or directory is deleted using the Hadoop shell, the file
is moved to the .Trash directory of the user's home directory instead of
being deleted. Deleted files are initially moved (renamed) to the Current
sub-directory of the .Trash directory with original path being preserved.
Files and directories in the trash can be restored simply by moving them to a
location outside the .Trash directory.
Due to the limited rename support, delete sub-directory/file within
encryption zone with trash feature is not allowed. Client has to use
-skipTrash option to work around this. HADOOP-10902 and HDFS-6767 improved
the error message but without a complete solution to the problem.
We propose to solve the problem by generalizing the mapping between
encryption zone and its underlying HDFS directories from 1:1 today to 1:N.
The encryption zone should allow non-overlapped directories such as scratch
space or soft delete trash locations to be added/removed dynamically after
creation. This way, rename for scratch space and soft delete can be
better supported without breaking the assumption that rename is only
supported within the zone.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks


[ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653930#comment-14653930
 ] 

Hudson commented on HDFS-8850:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #274 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/274/])
HDFS-8850. VolumeScanner thread exits with exception if there is no block pool 
to be scanned but there are suspicious blocks. (Colin Patrick McCabe via yliu) 
(yliu: rev 9a0806824e8982ec3b47315f9b960474186a)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 VolumeScanner thread exits with exception if there is no block pool to be 
 scanned but there are suspicious blocks
 -

 Key: HDFS-8850
 URL: https://issues.apache.org/jira/browse/HDFS-8850
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.7.0
Reporter: Colin Patrick McCabe
Assignee: Colin Patrick McCabe
 Fix For: 2.8.0

 Attachments: HDFS-8850.001.patch


 The VolumeScanner threads inside the BlockScanner exit with an exception if 
 there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8855) Webhdfs client leaks active NameNode connections


[ 
https://issues.apache.org/jira/browse/HDFS-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654107#comment-14654107
 ] 

Bob Hansen commented on HDFS-8855:
--

Reproducer script:

{code}
#!/bin/bash

# Check that the hadoop command is available
hadoop fs -help  /dev/null 2 /dev/null
if [ $? != 0 ]; then
echo The hadoop command must be in your path
exit 1
fi

# segment, op=OPEN and offset are added to url_base   
file_size=${file_size:-$[ 1024 * 1024 * 1024 ]}
count=${count:-100}
reads_per_pass=${reads_per_pass:-1000}
webhdfs_namenode=${webhdfs_namenode:-localhost:50070}
read_size=${read_size:-64000}
concurrent_reads=${concurrent_reads:-50}

url_base=http://$webhdfs_namenode/webhdfs/v1/tmp/bigfile_$$;
passes=$[ $count / $reads_per_pass ]
url_list_file=/tmp/file_list_$$.txt
namenode=${namenode:-`echo $url_base | grep -Po (?=http://)[^:/]*`}

echo Environment settings:
echo   file_size=$file_size
echo   count=$count
echo   reads_per_pass=$reads_per_pass
echo   webhdfs_namenode=$webhdfs_namenode
echo   read_size=$read_size
echo   concurrent_reads=$concurrent_reads
echo Outputs in /tmp/curl_[out|err]_$$
echo Computed values:
echo   url_base=$url_base
echo   passes=$passes
echo   url_list_file=$url_list_file
echo   namenode=$namenode
echo
echo Copying temp data...
blocks_to_copy=$[ ( $file_size + 1023 ) / 1024 ]
dd count=$blocks_to_copy bs=1024 if=/dev/zero | tr \0 + | hadoop fs 
-copyFromLocal - /tmp/bigfile_$$

echo Generating URL list...
# Generate the load profile
rm -f $url_list_file
for j in `seq 1 $reads_per_pass`; do
  rand=$(od -N 4 -t uL -An /dev/urandom | tr -d  )
  offset=$[ ( $rand % (file_size / read_size) * read_size )]
  url=$url_base?op=OPEN\user.name=$USER\offset=$offset\length=$read_size
  echo url = \$url\  $url_list_file
done

# Open $concurrent_reads files and do $reads_per_pass random reads of 
$read_size 
for i in `seq 1 $passes` ; do
  # Kick off concurrent random reads
  for k in `seq 1 $concurrent_reads`; do
curl -v -L -K $url_list_file  /tmp/curl_out_$$-$k.txt 
2/tmp/curl_err_$$-$k.txt 
  done
  
  # Wait for all curl jobs to finish
  while [ `jobs | grep Running.*curl | wc -l` != 0 ]; do
sleep 1s

# Every second, count the connections on the webhdfs_namenode
ssh $namenode file=/tmp/netstat.out_\$\$ ; netstat -an  \$file ; echo 
-n 'ESTABLISHED: '; echo -n \`grep -c ESTABLISHED \$file\` ; echo -n '  
TIME_WAIT: '; echo -n \`grep -c TIME_WAIT \$file\` ; echo -n '  CLOSE_WAIT: '; 
grep -c CLOSE_WAIT \$file; rm \$file
echo `grep HTTP/1.1 [^23] /tmp/curl_err_$$-* | wc -l` errors, `grep 
HTTP/1.1 200 /tmp/curl_err_$$-* | wc -l` successes
  done

  # Display the completion time
  echo -n Pass $i   ;
  date +%H:%M:%S.%N
  echo Total: `grep HTTP/1.1 [^23] /tmp/curl_err_$$-* | wc -l` errors, `grep 
HTTP/1.1 200 /tmp/curl_err_$$-* | wc -l` successes

#  sleep $delay
done
{code}

 Webhdfs client leaks active NameNode connections
 

 Key: HDFS-8855
 URL: https://issues.apache.org/jira/browse/HDFS-8855
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
 Environment: HDP 2.2
Reporter: Bob Hansen

 The attached script simulates a process opening ~50 files via webhdfs and 
 performing random reads.  Note that there are at most 50 concurrent reads, 
 and all webhdfs sessions are kept open.  Each read is ~64k at a random 
 position.  
 The script periodically (once per second) shells into the NameNode and 
 produces a summary of the socket states.  For my test cluster with 5 nodes, 
 it took ~30 seconds for the NameNode to have ~25000 active connections and 
 fails.
 It appears that each request to the webhdfs client is opening a new 
 connection to the NameNode and keeping it open after the request is complete. 
  If the process continues to run, eventually (~30-60 seconds), all of the 
 open connections are closed and the NameNode recovers.  
 This smells like SoftReference reaping.  Are we using SoftReferences in the 
 webhdfs client to cache NameNode connections but never re-using them?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-8855) Webhdfs client leaks active NameNode connections

Bob Hansen created HDFS-8855:


 Summary: Webhdfs client leaks active NameNode connections
 Key: HDFS-8855
 URL: https://issues.apache.org/jira/browse/HDFS-8855
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
 Environment: HDP 2.2
Reporter: Bob Hansen


The attached script simulates a process opening ~50 files via webhdfs and 
performing random reads.  Note that there are at most 50 concurrent reads, and 
all webhdfs sessions are kept open.  Each read is ~64k at a random position.  

The script periodically (once per second) shells into the NameNode and produces 
a summary of the socket states.  For my test cluster with 5 nodes, it took ~30 
seconds for the NameNode to have ~25000 active connections and fails.

It appears that each request to the webhdfs client is opening a new connection 
to the NameNode and keeping it open after the request is complete.  If the 
process continues to run, eventually (~30-60 seconds), all of the open 
connections are closed and the NameNode recovers.  

This smells like SoftReference reaping.  Are we using SoftReferences in the 
webhdfs client to cache NameNode connections but never re-using them?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8486) DN startup may cause severe data loss


 [ 
https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-8486:

Attachment: HDFS-8486-branch-2.6.patch

Patch for branch-2.6. Can someone familiar with the original change code review 
it?

 DN startup may cause severe data loss
 -

 Key: HDFS-8486
 URL: https://issues.apache.org/jira/browse/HDFS-8486
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 0.23.1, 2.0.0-alpha
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
  Labels: 2.6.1-candidate
 Fix For: 2.7.1

 Attachments: HDFS-8486-branch-2.6.patch, HDFS-8486.patch, 
 HDFS-8486.patch


 A race condition between block pool initialization and the directory scanner 
 may cause a mass deletion of blocks in multiple storages.
 If block pool initialization finds a block on disk that is already in the 
 replica map, it deletes one of the blocks based on size, GS, etc.  
 Unfortunately it _always_ deletes one of the blocks even if identical, thus 
 the replica map _must_ be empty when the pool is initialized.
 The directory scanner starts at a random time within its periodic interval 
 (default 6h).  If the scanner starts very early it races to populate the 
 replica map, causing the block pool init to erroneously delete blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8486) DN startup may cause severe data loss


 [ 
https://issues.apache.org/jira/browse/HDFS-8486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-8486:

Attachment: HDFS-8486-branch-2.6.02.patch

 DN startup may cause severe data loss
 -

 Key: HDFS-8486
 URL: https://issues.apache.org/jira/browse/HDFS-8486
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 0.23.1, 2.0.0-alpha
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Blocker
  Labels: 2.6.1-candidate
 Fix For: 2.7.1

 Attachments: HDFS-8486-branch-2.6.02.patch, 
 HDFS-8486-branch-2.6.patch, HDFS-8486.patch, HDFS-8486.patch


 A race condition between block pool initialization and the directory scanner 
 may cause a mass deletion of blocks in multiple storages.
 If block pool initialization finds a block on disk that is already in the 
 replica map, it deletes one of the blocks based on size, GS, etc.  
 Unfortunately it _always_ deletes one of the blocks even if identical, thus 
 the replica map _must_ be empty when the pool is initialized.
 The directory scanner starts at a random time within its periodic interval 
 (default 6h).  If the scanner starts very early it races to populate the 
 replica map, causing the block pool init to erroneously delete blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8747) Provide Better Scratch Space and Soft Delete Support for HDFS Encryption Zones

2015-08-04 Thread Xiaoyu Yao (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654086#comment-14654086
]

Xiaoyu Yao commented on HDFS-8747:
--

Thanks [~andrew.wang] and [~hitliuyi] for providing the feedback! Please help
reviewing patch for HDFS-8830.

We have customers too who want to use encryption with Trash. Adding user folder
trash folder to encryption zone is cheapest solution after HDFS-8830. It is an
one-time setting (per user/per zone) requiring no code changes. We get the
snapshot automatically for users' Trash with default trash policy for free.
This is better than manually snapshots for different users' Trash or not
support it at all with -skipTrash. We could still explore other solutions for
trash in HDFS-8831.

bq. I do think though that nested encryption zones would make this better yet
(for reasons even besides trash), and would not be too difficult to implement.

I understand nested zone helps the case where user's home folder may already in
an encryption zone for trash. Can you elaborate more on how nested zone would
make this better overall? We could add it if this really help.

Provide Better Scratch Space and Soft Delete Support for HDFS Encryption
Zones
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8643) Add snapshot names list to SnapshottableDirectoryStatus


 [ 
https://issues.apache.org/jira/browse/HDFS-8643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh R updated HDFS-8643:
---
Target Version/s: 2.8.0

 Add snapshot names list to SnapshottableDirectoryStatus
 ---

 Key: HDFS-8643
 URL: https://issues.apache.org/jira/browse/HDFS-8643
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Rakesh R
Assignee: Rakesh R
 Attachments: HDFS-8643-00.patch


 The idea of this jira to enhance {{SnapshottableDirectoryStatus}} by adding 
 {{snapshotNames}} attribute into it, presently it has the {{snapshotNumber}}. 
 IMHO this would help the users to get the list of snapshot names created. 
 Also, the snapshot names can be used while renaming or deleting the snapshots.
 {code}
 org.apache.hadoop.hdfs.protocol.SnapshottableDirectoryStatus.java
   /**
* @return Snapshot names for the directory.
*/
   public List String getSnapshotNames() {
 return snapshotNames;
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8270) create() always retried with hardcoded timeout when file already exists with open lease


 [ 
https://issues.apache.org/jira/browse/HDFS-8270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated HDFS-8270:
---
Attachment: HDFS-8270-branch-2.6-v3.patch

Branch 2.6 patch attached. This was a trivial backport from the 2.7 patch. I 
applied it and ran TestFileCreation (which passed).

 create() always retried with hardcoded timeout when file already exists with 
 open lease
 ---

 Key: HDFS-8270
 URL: https://issues.apache.org/jira/browse/HDFS-8270
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 2.6.0
Reporter: Andrey Stepachev
Assignee: J.Andreina
  Labels: 2.6.1-candidate
 Fix For: 2.7.1

 Attachments: HDFS-8270-branch-2.6-v3.patch, 
 HDFS-8270-branch-2.7-03.patch, HDFS-8270.1.patch, HDFS-8270.2.patch, 
 HDFS-8270.3.patch


 In Hbase we stumbled on unexpected behaviour, which could 
 break things. 
 HDFS-6478 fixed wrong exception
 translation, but that apparently led to unexpected bahaviour:
 clients trying to create file without override=true will be forced
 to retry hardcoded amount of time (60 seconds).
 That could break or slowdown systems, that use filesystem
 for locks (like hbase fsck did, and we got it broken HBASE-13574).
 We should make this behaviour configurable, do client really need
 to wait lease timeout to be sure that file doesn't exists, or it it should
 be enough to fail fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8844) TestHDFSCLI does not cleanup the test directory


[ 
https://issues.apache.org/jira/browse/HDFS-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654197#comment-14654197
 ] 

Colin Patrick McCabe commented on HDFS-8844:


+1.  Thanks, [~iwasakims].

 TestHDFSCLI does not cleanup the test directory
 ---

 Key: HDFS-8844
 URL: https://issues.apache.org/jira/browse/HDFS-8844
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Akira AJISAKA
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: HDFS-8844.001.patch


 If TestHDFSCLI is executed twice without {{mvn clean}}, the second try fails. 
 Here are the failing test cases:
 {noformat}
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(231)) - Failing tests:
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(232)) - --
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 226: get: getting non 
 existent(absolute path)
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 227: get: getting non existent 
 file(relative path)
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 228: get: Test for hdfs:// path - 
 getting non existent
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 229: get: Test for Namenode's path 
 - getting non existent
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 250: copyToLocal: non existent 
 relative path
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 251: copyToLocal: non existent 
 absolute path
 2015-07-31 21:35:17,655 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 252: copyToLocal: Test for hdfs:// 
 path - non existent file/directory
 2015-07-31 21:35:17,655 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 253: copyToLocal: Test for 
 Namenode's path - non existent file/directory
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8696) Reduce the variances of latency of WebHDFS


[ 
https://issues.apache.org/jira/browse/HDFS-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654210#comment-14654210
 ] 

Bob Hansen commented on HDFS-8696:
--

Still seeing some periodic slowdowns.  Could be related to HDFS-8855; the 
periodic drop in established connections in that bug may correlate with the 
periodic jumps in latency we're seeing in this bug.

 Reduce the variances of latency of WebHDFS
 --

 Key: HDFS-8696
 URL: https://issues.apache.org/jira/browse/HDFS-8696
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: webhdfs
Affects Versions: 2.7.0
Reporter: Xiaobing Zhou
Assignee: Xiaobing Zhou
 Attachments: HDFS-8696.1.patch, HDFS-8696.2.patch, HDFS-8696.3.patch


 There is an issue that appears related to the webhdfs server. When making two 
 concurrent requests, the DN will sometimes pause for extended periods (I've 
 seen 1-300 seconds), killing performance and dropping connections. 
 To reproduce: 
 1. set up a HDFS cluster
 2. Upload a large file (I was using 10GB). Perform 1-byte reads, writing
 the time out to /tmp/times.txt
 {noformat}
 i=1
 while (true); do 
 echo $i
 let i++
 /usr/bin/time -f %e -o /tmp/times.txt -a curl -s -L -o /dev/null 
 http://namenode:50070/webhdfs/v1/tmp/bigfile?op=OPENuser.name=rootlength=1;
 done
 {noformat}
 3. Watch for 1-byte requests that take more than one second:
 tail -F /tmp/times.txt | grep -E ^[^0]
 4. After it has had a chance to warm up, start doing large transfers from
 another shell:
 {noformat}
 i=1
 while (true); do 
 echo $i
 let i++
 (/usr/bin/time -f %e curl -s -L -o /dev/null 
 http://namenode:50070/webhdfs/v1/tmp/bigfile?op=OPENuser.name=root);
 done
 {noformat}
 It's easy to find after a minute or two that small reads will sometimes
 pause for 1-300 seconds. In some extreme cases, it appears that the
 transfers timeout and the DN drops the connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8829) DataNode sets SO_RCVBUF explicitly is disabling tcp auto-tuning


[ 
https://issues.apache.org/jira/browse/HDFS-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654211#comment-14654211
 ] 

Colin Patrick McCabe commented on HDFS-8829:


Why not add a configuration option for the receive buffer size, and if it is 
set to -1, use auto-tuning?

 DataNode sets SO_RCVBUF explicitly is disabling tcp auto-tuning
 ---

 Key: HDFS-8829
 URL: https://issues.apache.org/jira/browse/HDFS-8829
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.3.0, 2.6.0
Reporter: He Tianyi
Assignee: kanaka kumar avvaru

 {code:java}
   private void initDataXceiver(Configuration conf) throws IOException {
 // find free port or use privileged port provided
 TcpPeerServer tcpPeerServer;
 if (secureResources != null) {
   tcpPeerServer = new TcpPeerServer(secureResources);
 } else {
   tcpPeerServer = new TcpPeerServer(dnConf.socketWriteTimeout,
   DataNode.getStreamingAddr(conf));
 }
 
 tcpPeerServer.setReceiveBufferSize(HdfsConstants.DEFAULT_DATA_SOCKET_SIZE);
 {code}
 The last line sets SO_RCVBUF explicitly, thus disabling tcp auto-tuning on 
 some system.
 Shall we make this behavior configurable?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8855) Webhdfs client leaks active NameNode connections


[ 
https://issues.apache.org/jira/browse/HDFS-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654206#comment-14654206
 ] 

Bob Hansen commented on HDFS-8855:
--

Note: to exacerbate the test-case set the environment to have file_size=1 and 
read_size=1

 Webhdfs client leaks active NameNode connections
 

 Key: HDFS-8855
 URL: https://issues.apache.org/jira/browse/HDFS-8855
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
 Environment: HDP 2.2
Reporter: Bob Hansen

 The attached script simulates a process opening ~50 files via webhdfs and 
 performing random reads.  Note that there are at most 50 concurrent reads, 
 and all webhdfs sessions are kept open.  Each read is ~64k at a random 
 position.  
 The script periodically (once per second) shells into the NameNode and 
 produces a summary of the socket states.  For my test cluster with 5 nodes, 
 it took ~30 seconds for the NameNode to have ~25000 active connections and 
 fails.
 It appears that each request to the webhdfs client is opening a new 
 connection to the NameNode and keeping it open after the request is complete. 
  If the process continues to run, eventually (~30-60 seconds), all of the 
 open connections are closed and the NameNode recovers.  
 This smells like SoftReference reaping.  Are we using SoftReferences in the 
 webhdfs client to cache NameNode connections but never re-using them?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8842) Catch throwable


 [ 
https://issues.apache.org/jira/browse/HDFS-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-8842:
---
Target Version/s: 2.7.1, 2.5.2, 2.5.1, 2.7.0, 2.6.0  (was: 2.6.0, 2.7.0, 
2.5.1, 2.5.2, 2.7.1)
Priority: Trivial  (was: Critical)

 Catch throwable 
 

 Key: HDFS-8842
 URL: https://issues.apache.org/jira/browse/HDFS-8842
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: songwanging
Priority: Trivial

 We came across a few instances where the code catches Throwable, but fails to 
 rethrow anything.
 Throwable is the parent type of Exception and Error, so catching Throwable 
 means catching both Exceptions as well as Errors. An Exception is something 
 you could recover (like IOException), an Error is something more serious and 
 usually you could'nt recover easily (like ClassNotFoundError) so it doesn't 
 make much sense to catch an Error. 
 We should convert Throwable to Exception.
 For example:
 In method tryGetPid(Process p) of class: 
 hadoop-2.7.1-src\hadoop-common-project\hadoop-common\src\main\java\org\apache\hadoop\ha\ShellCommandFencer.java
 code:
 private static String tryGetPid(Process p) {
 try {
 ...
 } catch (Throwable t) {
   LOG.trace(Unable to determine pid for  + p, t);
   return null;
 }
   }
 In method uncaughtException(Thread t, Throwable e) of class: 
 hadoop-2.7.1-src\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-common\src\main\java\org\apache\hadoop\yarn\YarnUncaughtExceptionHandler.java
 code:
 public void uncaughtException(Thread t, Throwable e) {
...
   try {
 LOG.fatal(Thread  + t +  threw an Error.  Shutting down now..., 
 e);
   } catch (Throwable err) {
 //We don't want to not exit because of an issue with logging
   }
 ...
 try {
   System.err.println(Halting due to Out Of Memory Error...);
 } catch (Throwable err) {
   //Again we done want to exit because of logging issues.
 }
  ...
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

[
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654234#comment-14654234
]

Colin Patrick McCabe commented on HDFS-8791:

The motivation behind the new layout was to eventually free the DataNode of the
need to keep all block metadata in memory at all times. Basically, we are
entering a world where hard drive storage capacities double every year, but CPU
and network increase at a relatively slower pace. So keeping around
information about every replica permanently paged into memory looks antiquated.
The new layout lets us avoid this by being able to find any block just based
on its ID. It is basically the equivalent of paged metadata, but for the DN.

We didn't think about the du problem when discussing the new layout. It
looks like HDFS ends up running a du on all of the replica files quite a lot.
It's something we do after every I/O error, and also something we do on
startup. I think it's pretty silly that we run du after every I/O error-- we
could certainly change that-- and the fact that it's not rate-limited is even
worse. We don't even confine the du to the drive where the I/O error
occurred, but do it on every drive... I don't think anyone can give a good
reason for that and it should certainly be changed as well.

The startup issue is more difficult to avoid. If we have to do a du on all
files during startup, then it could cause very long startup times if that
involves a lot of seeks. It seems like both the old and the new layout would
have major problems with this scenario-- if you look out a year or two and
multiply the current number of replicas by 8 or 16.

If we are going to bump layout version again we might want to consider
something like keeping the replica metadata in leveldb. This would avoid the
need to do a du on startup and allow us to control our own caching. It could
also cut the number of ext4 files in half since we wouldn't need {{meta}} any
more.

block ID-based DN storage layout can be very slow for datanode on ext4
--

Key: HDFS-8791
URL: https://issues.apache.org/jira/browse/HDFS-8791
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.6.0
Reporter: Nathan Roberts
Priority: Critical

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8696) Reduce the variances of latency of WebHDFS

2015-08-04 Thread Jun Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654263#comment-14654263
 ] 

Jun Yin commented on HDFS-8696:
---

good catch, actually we use V3 which Xiaobing sent by email. 
The dfs.webhdfs.server.worker.threads was something taht old patch left in the 
config file, it has no effect with the latest patch I think.

 Reduce the variances of latency of WebHDFS
 --

 Key: HDFS-8696
 URL: https://issues.apache.org/jira/browse/HDFS-8696
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: webhdfs
Affects Versions: 2.7.0
Reporter: Xiaobing Zhou
Assignee: Xiaobing Zhou
 Attachments: HDFS-8696.1.patch, HDFS-8696.2.patch, HDFS-8696.3.patch


 There is an issue that appears related to the webhdfs server. When making two 
 concurrent requests, the DN will sometimes pause for extended periods (I've 
 seen 1-300 seconds), killing performance and dropping connections. 
 To reproduce: 
 1. set up a HDFS cluster
 2. Upload a large file (I was using 10GB). Perform 1-byte reads, writing
 the time out to /tmp/times.txt
 {noformat}
 i=1
 while (true); do 
 echo $i
 let i++
 /usr/bin/time -f %e -o /tmp/times.txt -a curl -s -L -o /dev/null 
 http://namenode:50070/webhdfs/v1/tmp/bigfile?op=OPENuser.name=rootlength=1;
 done
 {noformat}
 3. Watch for 1-byte requests that take more than one second:
 tail -F /tmp/times.txt | grep -E ^[^0]
 4. After it has had a chance to warm up, start doing large transfers from
 another shell:
 {noformat}
 i=1
 while (true); do 
 echo $i
 let i++
 (/usr/bin/time -f %e curl -s -L -o /dev/null 
 http://namenode:50070/webhdfs/v1/tmp/bigfile?op=OPENuser.name=root);
 done
 {noformat}
 It's easy to find after a minute or two that small reads will sometimes
 pause for 1-300 seconds. In some extreme cases, it appears that the
 transfers timeout and the DN drops the connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8643) Add snapshot names list to SnapshottableDirectoryStatus

2015-08-04 Thread Xiaoyu Yao (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654334#comment-14654334
 ] 

Xiaoyu Yao commented on HDFS-8643:
--

Thanks [~rakeshr] for adding this. Patch LGTM. +1

One question: Have you thought of enhancing *lsSnapshottableDir* cmd of HDFS 
CLI with additional parameter to display the snapshot name information? This 
could be done in a separate JIRA?

 Add snapshot names list to SnapshottableDirectoryStatus
 ---

 Key: HDFS-8643
 URL: https://issues.apache.org/jira/browse/HDFS-8643
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Rakesh R
Assignee: Rakesh R
 Attachments: HDFS-8643-00.patch


 The idea of this jira to enhance {{SnapshottableDirectoryStatus}} by adding 
 {{snapshotNames}} attribute into it, presently it has the {{snapshotNumber}}. 
 IMHO this would help the users to get the list of snapshot names created. 
 Also, the snapshot names can be used while renaming or deleting the snapshots.
 {code}
 org.apache.hadoop.hdfs.protocol.SnapshottableDirectoryStatus.java
   /**
* @return Snapshot names for the directory.
*/
   public List String getSnapshotNames() {
 return snapshotNames;
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-8844) TestHDFSCLI does not cleanup the test directory


 [ 
https://issues.apache.org/jira/browse/HDFS-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-8844:
---
  Resolution: Fixed
   Fix Version/s: 2.8.0
Target Version/s: 2.8.0
  Status: Resolved  (was: Patch Available)

 TestHDFSCLI does not cleanup the test directory
 ---

 Key: HDFS-8844
 URL: https://issues.apache.org/jira/browse/HDFS-8844
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Akira AJISAKA
Assignee: Masatake Iwasaki
Priority: Minor
 Fix For: 2.8.0

 Attachments: HDFS-8844.001.patch


 If TestHDFSCLI is executed twice without {{mvn clean}}, the second try fails. 
 Here are the failing test cases:
 {noformat}
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(231)) - Failing tests:
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(232)) - --
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 226: get: getting non 
 existent(absolute path)
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 227: get: getting non existent 
 file(relative path)
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 228: get: Test for hdfs:// path - 
 getting non existent
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 229: get: Test for Namenode's path 
 - getting non existent
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 250: copyToLocal: non existent 
 relative path
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 251: copyToLocal: non existent 
 absolute path
 2015-07-31 21:35:17,655 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 252: copyToLocal: Test for hdfs:// 
 path - non existent file/directory
 2015-07-31 21:35:17,655 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 253: copyToLocal: Test for 
 Namenode's path - non existent file/directory
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8696) Reduce the variances of latency of WebHDFS

2015-08-04 Thread Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654152#comment-14654152
 ] 

Jun commented on HDFS-8696:
---

Hi Xiaobing,

from my test, we got some unexpected result with a larger file(18G) reading.

case #1 - unpatched, hdfs-site.xml has following parameters:
dfs.webhdfs.server.worker.threads= 100;
dfs.webhdfs.server.max.connection.queue.length = 1024;
dfs.webhdfs.net.send.buf.size = 65535;
dfs.webhdfs.net.receive.buf.size = 65535;
dfs.webhdfs.channel.write.buf.low.watermark = 65535;
dfs.webhdfs.channel.write.buf.high.watermark = 131070;
large read test:
$ while (true); do /usr/bin/time -f %e -o /tmp/times.txt -a curl -s -L -o 
/dev/null 
http://NN:50070/webhdfs/v1/tmp/catalog_sales_38_50.dat?op=OPENuser.name=release;;
 done
$ while (true); do /usr/bin/time -f %e -o /tmp/times.txt -a curl -s -L -o 
/dev/null 
http://NN:50070/webhdfs/v1/tmp/catalog_sales_38_50.dat?op=OPENuser.name=releaselength=1;;
 done
$ tail -F /tmp/times.txt | grep -E ^[^0]
result: 
according to the /tmp/times.txt, delays are in the range 30-60s

case #2 - patched, also set required parameters in the config file - 
hdfs-site.xml
large read test as same as case #1, result:
delays are in the range of 40-90s, 2 extremely slow - 155s and 174s

I will update with some percentile later 
Thanks


 Reduce the variances of latency of WebHDFS
 --

 Key: HDFS-8696
 URL: https://issues.apache.org/jira/browse/HDFS-8696
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: webhdfs
Affects Versions: 2.7.0
Reporter: Xiaobing Zhou
Assignee: Xiaobing Zhou
 Attachments: HDFS-8696.1.patch, HDFS-8696.2.patch, HDFS-8696.3.patch


 There is an issue that appears related to the webhdfs server. When making two 
 concurrent requests, the DN will sometimes pause for extended periods (I've 
 seen 1-300 seconds), killing performance and dropping connections. 
 To reproduce: 
 1. set up a HDFS cluster
 2. Upload a large file (I was using 10GB). Perform 1-byte reads, writing
 the time out to /tmp/times.txt
 {noformat}
 i=1
 while (true); do 
 echo $i
 let i++
 /usr/bin/time -f %e -o /tmp/times.txt -a curl -s -L -o /dev/null 
 http://namenode:50070/webhdfs/v1/tmp/bigfile?op=OPENuser.name=rootlength=1;
 done
 {noformat}
 3. Watch for 1-byte requests that take more than one second:
 tail -F /tmp/times.txt | grep -E ^[^0]
 4. After it has had a chance to warm up, start doing large transfers from
 another shell:
 {noformat}
 i=1
 while (true); do 
 echo $i
 let i++
 (/usr/bin/time -f %e curl -s -L -o /dev/null 
 http://namenode:50070/webhdfs/v1/tmp/bigfile?op=OPENuser.name=root);
 done
 {noformat}
 It's easy to find after a minute or two that small reads will sometimes
 pause for 1-300 seconds. In some extreme cases, it appears that the
 transfers timeout and the DN drops the connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8696) Reduce the variances of latency of WebHDFS


[ 
https://issues.apache.org/jira/browse/HDFS-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654168#comment-14654168
 ] 

Bob Hansen commented on HDFS-8696:
--

Jun - thanks for posting those.

Can you break down what percentage of requests that were  50ms, 50-1000ms, 
+1000ms for the patched and unpatched sets?

The patched test was with patch v1 (which used the 
dfs.webhdfs.server.worker.threads setting), correct?

 Reduce the variances of latency of WebHDFS
 --

 Key: HDFS-8696
 URL: https://issues.apache.org/jira/browse/HDFS-8696
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: webhdfs
Affects Versions: 2.7.0
Reporter: Xiaobing Zhou
Assignee: Xiaobing Zhou
 Attachments: HDFS-8696.1.patch, HDFS-8696.2.patch, HDFS-8696.3.patch


 There is an issue that appears related to the webhdfs server. When making two 
 concurrent requests, the DN will sometimes pause for extended periods (I've 
 seen 1-300 seconds), killing performance and dropping connections. 
 To reproduce: 
 1. set up a HDFS cluster
 2. Upload a large file (I was using 10GB). Perform 1-byte reads, writing
 the time out to /tmp/times.txt
 {noformat}
 i=1
 while (true); do 
 echo $i
 let i++
 /usr/bin/time -f %e -o /tmp/times.txt -a curl -s -L -o /dev/null 
 http://namenode:50070/webhdfs/v1/tmp/bigfile?op=OPENuser.name=rootlength=1;
 done
 {noformat}
 3. Watch for 1-byte requests that take more than one second:
 tail -F /tmp/times.txt | grep -E ^[^0]
 4. After it has had a chance to warm up, start doing large transfers from
 another shell:
 {noformat}
 i=1
 while (true); do 
 echo $i
 let i++
 (/usr/bin/time -f %e curl -s -L -o /dev/null 
 http://namenode:50070/webhdfs/v1/tmp/bigfile?op=OPENuser.name=root);
 done
 {noformat}
 It's easy to find after a minute or two that small reads will sometimes
 pause for 1-300 seconds. In some extreme cases, it appears that the
 transfers timeout and the DN drops the connection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8762) Erasure Coding: the log of each streamer should show its index


[ 
https://issues.apache.org/jira/browse/HDFS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654327#comment-14654327
 ] 

Zhe Zhang commented on HDFS-8762:
-

Thanks Bo for the work and Walter for the comment.

# I don't think we should change {{createSocketForPipeline}} and other static 
elements to non-static. Being {{static}} means that we don't need an instance 
of {{DataStreamer}} to call the methods. Then why do we need to know the 
streamer ID in debugging? For example, {{createSocketForPipeline}} only creates 
a socket to a given DN. Some other logic (such as {{createBlockOutputStream}}) 
will associate it with a streamer, and that's where we should print streamer ID.
# Adding a trivial {{logString}} method in {{DataStreamer}} looks a little 
awkward. Appending {{this}} to every log message (as Walter suggested) is fine 
for {{debug}} but too much for {{info}}. How about similar to Walter's 
suggestion, but append {{this.appendToLog()}} instead of {{this}}? Then 
{{StripedDataStreamer}} can override {{appendToLog}}. This is similar to 
{{Block#appendStringTo}}.

 Erasure Coding: the log of each streamer should show its index
 --

 Key: HDFS-8762
 URL: https://issues.apache.org/jira/browse/HDFS-8762
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Li Bo
Assignee: Li Bo
 Attachments: HDFS-8762-HDFS-7285-001.patch, 
 HDFS-8762-HDFS-7285-002.patch


 The log in {{DataStreamer}} doesn't show which streamer it's generated from. 
 In order to make log information more convenient for debugging, each log 
 should include the index of the streamer it's generated from. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8844) TestHDFSCLI does not cleanup the test directory


[ 
https://issues.apache.org/jira/browse/HDFS-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654226#comment-14654226
 ] 

Hudson commented on HDFS-8844:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8260 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8260/])
HDFS-8844. TestHDFSCLI does not cleanup the test directory (Masatake Iwasaki 
via Colin P. McCabe) (cmccabe: rev c95993cbaf51e2925ea9b1b95cf4f0d879e66489)
* hadoop-hdfs-project/hadoop-hdfs/src/test/resources/testHDFSConf.xml
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


 TestHDFSCLI does not cleanup the test directory
 ---

 Key: HDFS-8844
 URL: https://issues.apache.org/jira/browse/HDFS-8844
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Akira AJISAKA
Assignee: Masatake Iwasaki
Priority: Minor
 Attachments: HDFS-8844.001.patch


 If TestHDFSCLI is executed twice without {{mvn clean}}, the second try fails. 
 Here are the failing test cases:
 {noformat}
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(231)) - Failing tests:
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(232)) - --
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 226: get: getting non 
 existent(absolute path)
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 227: get: getting non existent 
 file(relative path)
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 228: get: Test for hdfs:// path - 
 getting non existent
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 229: get: Test for Namenode's path 
 - getting non existent
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 250: copyToLocal: non existent 
 relative path
 2015-07-31 21:35:17,654 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 251: copyToLocal: non existent 
 absolute path
 2015-07-31 21:35:17,655 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 252: copyToLocal: Test for hdfs:// 
 path - non existent file/directory
 2015-07-31 21:35:17,655 [main] INFO  cli.CLITestHelper 
 (CLITestHelper.java:displayResults(238)) - 253: copyToLocal: Test for 
 Namenode's path - non existent file/directory
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8772) fix TestStandbyIsHot#testDatanodeRestarts which occasionally fails

2015-08-04 Thread Andrew Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654641#comment-14654641
 ] 

Andrew Wang commented on HDFS-8772:
---

LGTM, just one q, should we bump up the waitFor to be less aggressive like 30s, 
and the overall test timeout from 60s to 120s? Test boxes can be overloaded at 
times and run slowly, so I'd like some more safety to avoid flakiness.

 fix TestStandbyIsHot#testDatanodeRestarts which occasionally fails  
 

 Key: HDFS-8772
 URL: https://issues.apache.org/jira/browse/HDFS-8772
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Walter Su
Assignee: Walter Su
 Attachments: HDFS-8772.01.patch, HDFS-8772.02.patch, 
 HDFS-8772.03.patch


 https://builds.apache.org/job/PreCommit-HDFS-Build/11596/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11598/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11599/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11600/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11606/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11608/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11612/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11618/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11650/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11655/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11659/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11663/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11664/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11667/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11669/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11676/testReport/
 https://builds.apache.org/job/PreCommit-HDFS-Build/11677/testReport/
 {noformat}
 java.lang.AssertionError: expected:0 but was:4
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.failNotEquals(Assert.java:743)
   at org.junit.Assert.assertEquals(Assert.java:118)
   at org.junit.Assert.assertEquals(Assert.java:555)
   at org.junit.Assert.assertEquals(Assert.java:542)
   at 
 org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyIsHot.testDatanodeRestarts(TestStandbyIsHot.java:188)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8849) fsck should report number of missing blocks with replication factor 1

2015-08-04 Thread Aaron T. Myers (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-8849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654647#comment-14654647
]

Aaron T. Myers commented on HDFS-8849:
--

Allen, I've seen plenty of users who at some point in the past have run
TeraSort on their cluster, and for that job the default output replication is
1. If a DN then goes offline that was containing some TeraSort output, then
blocks appear missing and users get concerned because they see missing blocks
on the NN web UI and via dfsadmin -report/fsck, but it's not obvious that those
blocks were in fact set to replication factor 1. In my experience, this is
really quite common, so definitely seems like something worthy of addressing to
me. How we go about addressing this should certainly be discussed, and it could
be that including this information in fsck doesn't make sense, but let's try to
come up with something that does address this issue.

Separately, using phrases like Meanwhile, back in real life and calling a
proposed improvement a useless feature is not an appropriate way to
communicate in this forum. Let's please try to keep the communication
constructive, not unnecessarily hostile. Comments like those contribute to the
perception that our community is difficult to contribute to.

fsck should report number of missing blocks with replication factor 1
-

HDFS-7165 supports reporting number of blocks with replication factor 1 in
{{dfsadmin}} and NN metrics. But it didn't extend {{fsck}} with the same
support, which is the aim of this JIRA.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8838) Tolerate datanode failures in DFSStripedOutputStream when the data length is small

2015-08-04 Thread Tsz Wo Nicholas Sze (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654654#comment-14654654
]

Tsz Wo Nicholas Sze commented on HDFS-8838:
---

[~walter.k.su], thanks for running the tests. I tried to use the same
MiniDFSCluster to run all the tests so that I had changed DN dir names.
Otherwise, the dir used by an old datanode cannot be re-used by a new datanode.
It needs more work since some other tests failed.TestBase not public is
fine in my machine. Not sure why it does not work for yours. Reverted both
for the moment.

3. The numBytes returned from getBlockGroup() is wrong. We should use the old
method, which is to use streamer #0 to check block boundry, but not other
streamers, as it used to be before the patch.

I changed it to check cell boundary instead of block group boundary. If we use
the old method, some of the new tests fail. However,
TestDFSStripedOutputStream fails after the change. I will see what's wrong.

Tolerate datanode failures in DFSStripedOutputStream when the data length is
small
--

Key: HDFS-8838
URL: https://issues.apache.org/jira/browse/HDFS-8838
Project: Hadoop HDFS
Issue Type: Sub-task
Components: hdfs-client
Reporter: Tsz Wo Nicholas Sze
Assignee: Tsz Wo Nicholas Sze
Attachments: h8838_20150729.patch, h8838_20150731-HDFS-7285.patch,
h8838_20150731.log, h8838_20150731.patch, h8838_20150804.patch

Currently, DFSStripedOutputStream cannot tolerate datanode failures when the
data length is small. We fix the bugs here and add more tests.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HDFS-7865) NullPointerException in SimulatedFSDataset

2015-08-04 Thread Brahma Reddy Battula (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula resolved HDFS-7865.

Resolution: Implemented

 NullPointerException in SimulatedFSDataset
 --

 Key: HDFS-7865
 URL: https://issues.apache.org/jira/browse/HDFS-7865
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: Tsz Wo Nicholas Sze
Assignee: Brahma Reddy Battula
Priority: Minor

 https://builds.apache.org/job/PreCommit-HDFS-Build/9690//testReport/org.apache.hadoop.hdfs.server.balancer/TestBalancer/testUnknownDatanode/
 {noformat}
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset$BInfo.access$400(SimulatedFSDataset.java:126)
   at 
 org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset.getPinning(SimulatedFSDataset.java:1319)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:969)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:244)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:249)
   at java.lang.Thread.run(Thread.java:745)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8827) Erasure Coding: When namenode processes over replicated striped block, NPE will be occur in ReplicationMonitor


[ 
https://issues.apache.org/jira/browse/HDFS-8827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654509#comment-14654509
 ] 

Zhe Zhang commented on HDFS-8827:
-

Thanks for adding the test [~tfukudom]. I ran the test and it didn't give NPE. 
Instead, it reported an assertion error at the end:
{code}
java.lang.AssertionError: 
Expected :8
Actual   :9
 Click to see difference
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.hdfs.DFSTestUtil.verifyLocatedStripedBlocks(DFSTestUtil.java:1983)
at 
org.apache.hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks.testProcessOverReplicatedSBSmallerThanFullBlocks(TestAddOverReplicatedStripedBlocks.java:156)
{code}

 Erasure Coding: When namenode processes over replicated striped block, NPE 
 will be occur in ReplicationMonitor
 --

 Key: HDFS-8827
 URL: https://issues.apache.org/jira/browse/HDFS-8827
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Takuya Fukudome
Assignee: Takuya Fukudome
 Attachments: HDFS-8827.1.patch, processing-over-replica-npe.log


 In our test cluster, when namenode processed over replicated striped blocks, 
 null pointer exception(NPE) occurred. This happened under below situation: 1) 
 some datanodes shutdown. 2) namenode recovers block group which lost internal 
 blocks. 3) restart the stopped datanodes. 4) namenode processes over 
 replicated striped blocks. 5) NPE occurs
 I think BlockPlacementPolicyDefault#chooseReplicaToDelete will return null in 
 this situation which causes this NPE problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8854) Erasure coding: Move cellSize inside ECSchema