[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-01-24 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837255#comment-15837255
 ] 

Chen Zhang commented on HDFS-11303:
---

Got it, thanks a lot for your explanation, Stack.




> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10927) Lease Recovery: File not getting closed on HDFS when block write operation fails

2017-02-16 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871310#comment-15871310
 ] 

Chen Zhang commented on HDFS-10927:
---

Hey guys, I just found an issue like this.
HBase RegionServer got an DiskFull exception while writing WAL files, and 
client failed after several times retry. When Master trying to use recoverLease 
to recover these file, we got almost same logs as [~ngoswami] attached
{quote}
java.io.IOException: File length mismatched.  The length of 
/home/work/ssd11/hdfs//datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW
  getNumBytes() = 174437376
  getBytesOnDisk()  = 174429752
  getVisibleLength()= 174429752
  getVolume()   = /home/work/ssd11/hdfs/x/datanode/current
  getBlockFile()= 
/home/work/ssd11/hdfs/x/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
  bytesAcked=174429752
  bytesOnDisk=174429752
{quote}

It's caused by the exception while out.write() in receivePacket() of 
BlockReceiver. 
receivePacket() first update numbytes in replicaInfo, then write data to disk, 
and update bytesOnDisk at last, the DiskFull exception makes the length not 
consistent.


> Lease Recovery: File not getting closed on HDFS when block write operation 
> fails
> 
>
> Key: HDFS-10927
> URL: https://issues.apache.org/jira/browse/HDFS-10927
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.7.1
>Reporter: Nitin Goswami
>
> HDFS was unable to close a file when block write operation failed because of 
> too high disk usage.
> Scenario:
> HBase was writing WAL logs on HDFS and the disk usage was too high at that 
> time. While writing these WAL logs, one of the blocks writes operation failed 
> with the following exception:
> 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/192.168.194.144:50010 remote=/192.168.192.162:43105]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.BufferedInputStream.fill(Unknown Source)
> at java.io.BufferedInputStream.read1(Unknown Source)
> at java.io.BufferedInputStream.read(Unknown Source)
> at java.io.DataInputStream.read(Unknown Source)
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
> at java.lang.Thread.run(Unknown Source)
> After this exception, HBase tried to close/rollover the WAL file but that 
> call also failed and WAL file couldn't be closed. After this HBase closed the 
> region server
> After some time, Lease Recovery got triggered for this file and following 
> exceptions starts occurring:
> 2016-09-13 11:51:11,743 WARN 
> org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to 
> obtain replica info for block 
> (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from 
> datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null])
> java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < 
> getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW
>   getNumBytes() = 45524696
>   getBytesOnDisk()  = 45483527
>   getVisibleLength()= 45511557
>   getVolume()   = /opt/reflex/data/yarn/datanode/current
>   getBlockFile()= 
> 

[jira] [Comment Edited] (HDFS-10927) Lease Recovery: File not getting closed on HDFS when block write operation fails

2017-02-16 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871310#comment-15871310
 ] 

Chen Zhang edited comment on HDFS-10927 at 2/17/17 7:25 AM:


Hey guys, I just found an issue like this.
HBase RegionServer got an DiskFull exception while writing WAL files, and 
client failed after several times retry. When Master trying to use recoverLease 
to recover these file, we got almost same logs as [~ngoswami] attached
{quote}
java.io.IOException: File length mismatched.  The length of 
/home/work/ssd11/hdfs//datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW
  getNumBytes() = 174437376
  getBytesOnDisk()  = 174429752
  getVisibleLength()= 174429752
  getVolume()   = /home/work/ssd11/hdfs/x/datanode/current
  getBlockFile()= 
/home/work/ssd11/hdfs/x/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
  bytesAcked=174429752
  bytesOnDisk=174429752
{quote}

In my case, it's caused by the exception while out.write() in receivePacket() 
of BlockReceiver. 
receivePacket() first update numbytes in replicaInfo, then write data to disk, 
and update bytesOnDisk at last, the DiskFull exception makes the length not 
consistent.



was (Author: zhangchen):
Hey guys, I just found an issue like this.
HBase RegionServer got an DiskFull exception while writing WAL files, and 
client failed after several times retry. When Master trying to use recoverLease 
to recover these file, we got almost same logs as [~ngoswami] attached
{quote}
java.io.IOException: File length mismatched.  The length of 
/home/work/ssd11/hdfs//datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW
  getNumBytes() = 174437376
  getBytesOnDisk()  = 174429752
  getVisibleLength()= 174429752
  getVolume()   = /home/work/ssd11/hdfs/x/datanode/current
  getBlockFile()= 
/home/work/ssd11/hdfs/x/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099
  bytesAcked=174429752
  bytesOnDisk=174429752
{quote}

It's caused by the exception while out.write() in receivePacket() of 
BlockReceiver. 
receivePacket() first update numbytes in replicaInfo, then write data to disk, 
and update bytesOnDisk at last, the DiskFull exception makes the length not 
consistent.


> Lease Recovery: File not getting closed on HDFS when block write operation 
> fails
> 
>
> Key: HDFS-10927
> URL: https://issues.apache.org/jira/browse/HDFS-10927
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.7.1
>Reporter: Nitin Goswami
>
> HDFS was unable to close a file when block write operation failed because of 
> too high disk usage.
> Scenario:
> HBase was writing WAL logs on HDFS and the disk usage was too high at that 
> time. While writing these WAL logs, one of the blocks writes operation failed 
> with the following exception:
> 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Exception for 
> BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/192.168.194.144:50010 remote=/192.168.192.162:43105]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.BufferedInputStream.fill(Unknown Source)
> at java.io.BufferedInputStream.read1(Unknown Source)
> at java.io.BufferedInputStream.read(Unknown Source)
> at java.io.DataInputStream.read(Unknown Source)
> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807)
> at 
> 

[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-01-16 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824141#comment-15824141
 ] 

Chen Zhang commented on HDFS-11303:
---

Hi Andrew,

I’m a newbie to apache community, HDFS-11303 is the first issue I've submitted
In last week, I received a mail said you updated this issue, thanks a lot for 
your attention on this issue!
But the issue seems been frozen to me, I can’t do any operation on it now, 
could you help point out what’s the status of this issue now, and what should I 
do next?

Thanks a lot

Best
Chen




> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-01-09 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813527#comment-15813527
 ] 

Chen Zhang commented on HDFS-11303:
---

Stack, thanks for your comments. Yes, the test is just to verify, it hangs 
without the fix.

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
> Attachments: HDFS-11303-001.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-01-08 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-11303:
--
Attachment: HDFS-11303-001.patch

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
> Attachments: HDFS-11303-001.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-01-08 Thread Chen Zhang (JIRA)
Chen Zhang created HDFS-11303:
-

 Summary: Hedged read might hang infinitely if read data from all 
DN failed 
 Key: HDFS-11303
 URL: https://issues.apache.org/jira/browse/HDFS-11303
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
Affects Versions: 3.0.0-alpha1
Reporter: Chen Zhang


Hedged read will read from a DN first, if timeout, then read other DNs 
simultaneously.
If read all DN failed, this bug will cause the future-list not empty(the first 
timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-08-20 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134583#comment-16134583
 ] 

Chen Zhang commented on HDFS-11303:
---

Thanks [~jzhuge] a lot for pushing this issue forward.
I'm sorry for leaving this issue aside, really busy on the work these days

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, 
> HDFS-11303-002.patch, HDFS-11303-002.patch, HDFS-11303.003.patch, 
> HDFS-11303.004.patch, HDFS-11303.005.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-06-12 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046574#comment-16046574
 ] 

Chen Zhang commented on HDFS-11303:
---

[~jzhuge] thanks for your help, I've fixed the checkstyle error.
The failed unit tests are not related with this patch

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, 
> HDFS-11303-002.patch, HDFS-11303-002.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8693) refreshNamenodes does not support adding a new standby to a running DN

2017-06-25 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062512#comment-16062512
 ] 

Chen Zhang commented on HDFS-8693:
--

Hi [~vinayrpet] [~ajithshetty], any progress on this issue?
Supporting add a new standby is very useful for large cluster operation. When 
one of the machine running namenode is down and we have to add another new 
standby,  restarting thousands of datanodes will take very long time. Once the 
active namenode is crushed during this time, whole cluster will not available.

> refreshNamenodes does not support adding a new standby to a running DN
> --
>
> Key: HDFS-8693
> URL: https://issues.apache.org/jira/browse/HDFS-8693
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ha
>Affects Versions: 2.6.0
>Reporter: Jian Fang
>Assignee: Ajith S
>Priority: Critical
> Attachments: HDFS-8693.02.patch, HDFS-8693.1.patch
>
>
> I tried to run the following command on a Hadoop 2.6.0 cluster with HA 
> support 
> $ hdfs dfsadmin -refreshNamenodes datanode-host:port
> to refresh name nodes on data nodes after I replaced one name node with a new 
> one so that I don't need to restart the data nodes. However, I got the 
> following error:
> refreshNamenodes: HA does not currently support adding a new standby to a 
> running DN. Please do a rolling restart of DNs to reconfigure the list of NNs.
> I checked the 2.6.0 code and the error was thrown by the following code 
> snippet, which led me to this JIRA.
> void refreshNNList(ArrayList addrs) throws IOException {
> Set oldAddrs = Sets.newHashSet();
> for (BPServiceActor actor : bpServices)
> { oldAddrs.add(actor.getNNSocketAddress()); }
> Set newAddrs = Sets.newHashSet(addrs);
> if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty())
> { // Keep things simple for now -- we can implement this at a later date. 
> throw new IOException( "HA does not currently support adding a new standby to 
> a running DN. " + "Please do a rolling restart of DNs to reconfigure the list 
> of NNs."); }
> }
> Looks like this the refreshNameNodes command is an uncompleted feature. 
> Unfortunately, the new name node on a replacement is critical for auto 
> provisioning a hadoop cluster with HDFS HA support. Without this support, the 
> HA feature could not really be used. I also observed that the new standby 
> name node on the replacement instance could stuck in safe mode because no 
> data nodes check in with it. Even with a rolling restart, it may take quite 
> some time to restart all data nodes if we have a big cluster, for example, 
> with 4000 data nodes, let alone restarting DN is way too intrusive and it is 
> not a preferable operation in production. It also increases the chance for a 
> double failure because the standby name node is not really ready for a 
> failover in the case that the current active name node fails. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-06-10 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-11303:
--
Attachment: HDFS-11303-002.patch

fix checkstyle error

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, 
> HDFS-11303-002.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-06-10 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-11303:
--
Status: In Progress  (was: Patch Available)

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, 
> HDFS-11303-002.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-06-12 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-11303:
--
Status: Open  (was: Patch Available)

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, 
> HDFS-11303-002.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-06-12 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-11303:
--
Status: Patch Available  (was: Open)

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, 
> HDFS-11303-002.patch, HDFS-11303-002.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-06-12 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-11303:
--
Attachment: HDFS-11303-002.patch

fix checkstyle issues

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, 
> HDFS-11303-002.patch, HDFS-11303-002.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed

2017-06-10 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-11303:
--
Status: Patch Available  (was: In Progress)

Re-submit the patch to kick off test-patch again.

> Hedged read might hang infinitely if read data from all DN failed 
> --
>
> Key: HDFS-11303
> URL: https://issues.apache.org/jira/browse/HDFS-11303
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, 
> HDFS-11303-002.patch
>
>
> Hedged read will read from a DN first, if timeout, then read other DNs 
> simultaneously.
> If read all DN failed, this bug will cause the future-list not empty(the 
> first timeout request left in list), and hang in the loop infinitely



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12936) java.lang.OutOfMemoryError: unable to create new native thread

2017-12-18 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294704#comment-16294704
 ] 

Chen Zhang commented on HDFS-12936:
---

Hey [~1028344...@qq.com],this error usually means you didn't set your system 
limits appropriately.
There's lots of system limits may cause this issue, such 
max-threads-per-process, max-open-files-per-process, etc.
This [answer on stackoverflow | 
https://stackoverflow.com/questions/34452302/how-to-increase-maximum-number-of-jvm-threads-linux-64bit]
 is a great guideline for you to find out which limits is not set 
appropriately, hope it can help

> java.lang.OutOfMemoryError: unable to create new native thread
> --
>
> Key: HDFS-12936
> URL: https://issues.apache.org/jira/browse/HDFS-12936
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0
> Environment: CDH5.12
> hadoop2.6
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> I configure the max user processes  65535 with any user ,and the datanode 
> memory is 8G.
> When a log of data was been writeen,the datanode was been shutdown.
> But I can see the memory use only < 1000M.
> Please to see https://pan.baidu.com/s/1o7BE0cy
> *DataNode shutdown error log:*  
> {code:java}
> 2017-12-17 23:58:14,422 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> PacketResponder: 
> BP-1437036909-192.168.17.36-1509097205664:blk_1074725940_987917, 
> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 2017-12-17 23:58:31,425 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. 
> Will retry in 30 seconds.
> java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:714)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-12-17 23:59:01,426 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. 
> Will retry in 30 seconds.
> java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:714)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-12-17 23:59:05,520 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. 
> Will retry in 30 seconds.
> java.lang.OutOfMemoryError: unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:714)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154)
>   at java.lang.Thread.run(Thread.java:745)
> 2017-12-17 23:59:31,429 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving BP-1437036909-192.168.17.36-1509097205664:blk_1074725951_987928 
> src: /192.168.17.54:40478 dest: /192.168.17.48:50010
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2018-07-01 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-13709:
--
Status: Patch Available  (was: Open)

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2017-12-28 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Description: 
Sometimes we need to run NNBench for some scaling tests after made some 
improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
Yarn cluster.
If NNBench support multi-cluster access, we only need to deploy a new HDFS test 
cluster and add it to existing YARN cluster, it'll make the scaling test easier.

Even more, if we want to do some A-B test, we have to run NNBench on different 
HDFS clusters, this patch will be helpful.

  was:
Sometimes we need to run NNBench for some scaling tests after made some 
improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
Yarn cluster.
If NNBench support multi-cluster access, we only need to deploy a new HDFS test 
cluster and add it to existing YARN cluster, it'll make the scaling test easier.

Even more, if we want to make some A-B test, we have to run NNBench on 
different HDFS clusters, this patch will be helpful.


> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-12967-001.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12967) NNBench should support multi-cluster access

2017-12-28 Thread Chen Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305877#comment-16305877
 ] 

Chen Zhang commented on HDFS-12967:
---

Thanks for your response and suggestion, [~jojochuang]. 

bq. what about adding a new command parameter so that this support is 
visible
I think it don't need a new command parameter, just using path with prefix like 
hdfs://some-cluster/user/foo will work. And it's reasonable to add some 
explanation in help text.

bq.  You should also consider adding tests in TestNNBench
Thanks for reminding this, I'll add a test soon

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
> Attachments: HDFS-12967-001.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-12967) NNBench should support multi-cluster access

2017-12-27 Thread Chen Zhang (JIRA)
Chen Zhang created HDFS-12967:
-

 Summary: NNBench should support multi-cluster access
 Key: HDFS-12967
 URL: https://issues.apache.org/jira/browse/HDFS-12967
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: benchmarks
Reporter: Chen Zhang


Sometimes we need to run NNBench for some scaling tests after made some 
improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
Yarn cluster.
If NNBench support multi-cluster access, we only need to deploy a new HDFS test 
cluster and add it to existing YARN cluster, it'll make the scaling test easier.

Even more, if we want to make some A-B test, we have to run NNBench on 
different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2017-12-27 Thread Chen Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Attachment: HDFS-12967-001.patch

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
> Attachments: HDFS-12967-001.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to make some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2018-06-29 Thread Chen Zhang (JIRA)
Chen Zhang created HDFS-13709:
-

 Summary: Report bad block to NN when transfer block encounter EIO 
exception
 Key: HDFS-13709
 URL: https://issues.apache.org/jira/browse/HDFS-13709
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Reporter: Chen Zhang
Assignee: Chen Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2018-06-29 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-13709:
--
Attachment: HDFS-13709.patch

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which hav data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2018-06-29 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-13709:
--
Description: 
In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
disk bad track may cause data loss.

For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs on 
A's replica data, and someday B and C crushed at the same time, NN will try to 
replicate data from A but failed, this block is corrupt now but no one knows, 
because NN think there is at least 1 healthy replica and it keep trying to 
replicate it.

When reading a replica which hav data on bad track, OS will return an EIO 
error, if DN reports the bad block as soon as it got an EIO,  we can find this 
case ASAP and try to avoid data loss

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which hav data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2018-06-29 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-13709:
--
Description: 
In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
disk bad track may cause data loss.

For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs on 
A's replica data, and someday B and C crushed at the same time, NN will try to 
replicate data from A but failed, this block is corrupt now but no one knows, 
because NN think there is at least 1 healthy replica and it keep trying to 
replicate it.

When reading a replica which have data on bad track, OS will return an EIO 
error, if DN reports the bad block as soon as it got an EIO,  we can find this 
case ASAP and try to avoid data loss

  was:
In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
disk bad track may cause data loss.

For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs on 
A's replica data, and someday B and C crushed at the same time, NN will try to 
replicate data from A but failed, this block is corrupt now but no one knows, 
because NN think there is at least 1 healthy replica and it keep trying to 
replicate it.

When reading a replica which hav data on bad track, OS will return an EIO 
error, if DN reports the bad block as soon as it got an EIO,  we can find this 
case ASAP and try to avoid data loss


> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Status: Open  (was: Patch Available)

submitted the wrong version for patch-002, which would cause the compilation 
fail

update the patch-003

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, 
> HDFS-12967-003.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790168#comment-16790168
 ] 

Chen Zhang commented on HDFS-12967:
---

[~jojochuang], would you please help me to review this patch again? I am sorry 
that it has been delayed for so long.

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Status: Patch Available  (was: Open)

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Attachment: (was: HDFS-12967-002.patch)

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Attachment: HDFS-12967-002.patch

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790147#comment-16790147
 ] 

Chen Zhang commented on HDFS-12967:
---

{quote}Did you try {{TestDFSIO}} instead. Should be the right tool for scale 
testing.
{quote}
Thanks [~shv] for your advice, we used TestDFSIO to measure the cluster 
throughput and performance, I think it's more suitable for data access 
performance testing,NNBench, as it's name implies, is special designed for 
stress test of NameNode operations
{quote}Also NNBench has {{-baseDir}} option.
{quote}
My patch is just some minor changes on this option to make NNBench support 
multi-cluster

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Attachment: HDFS-12967-003.patch

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, 
> HDFS-12967-003.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Status: Patch Available  (was: Open)

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, 
> HDFS-12967-003.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Status: Patch Available  (was: Open)

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Status: In Progress  (was: Patch Available)

Add UT and update the help message

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work stopped] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-12967 stopped by Chen Zhang.
-
> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-03-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Attachment: HDFS-12967-002.patch

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-3246) pRead equivalent for direct read path

2019-02-13 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang reassigned HDFS-3246:


Assignee: Chen Zhang  (was: Henry Robinson)

> pRead equivalent for direct read path
> -
>
> Key: HDFS-3246
> URL: https://issues.apache.org/jira/browse/HDFS-3246
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, performance
>Affects Versions: 3.0.0-alpha1
>Reporter: Henry Robinson
>Assignee: Chen Zhang
>Priority: Major
>
> There is no pread equivalent in ByteBufferReadable. We should consider adding 
> one. It would be relatively easy to implement for the distributed case 
> (certainly compared to HDFS-2834), since DFSInputStream does most of the 
> heavy lifting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-3246) pRead equivalent for direct read path

2019-02-13 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767916#comment-16767916
 ] 

Chen Zhang commented on HDFS-3246:
--

Hi [~henryr], do you still work on this issue?

This issue is very old and didn't update for a long time, so I assigned to me 
directly, hope you don't mind

Thanks

> pRead equivalent for direct read path
> -
>
> Key: HDFS-3246
> URL: https://issues.apache.org/jira/browse/HDFS-3246
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, performance
>Affects Versions: 3.0.0-alpha1
>Reporter: Henry Robinson
>Assignee: Chen Zhang
>Priority: Major
>
> There is no pread equivalent in ByteBufferReadable. We should consider adding 
> one. It would be relatively easy to implement for the distributed case 
> (certainly compared to HDFS-2834), since DFSInputStream does most of the 
> heavy lifting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12967) NNBench should support multi-cluster access

2019-06-26 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873174#comment-16873174
 ] 

Chen Zhang commented on HDFS-12967:
---

Hi [~jojochuang], thanks for review. I've fixed these checkstyle errors, could 
you help to review the patch again? Thanks

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, 
> HDFS-12967-003.patch, HDFS-12967-004.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14576) Avoid block report retry and slow down namenode startup

2019-07-16 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886081#comment-16886081
 ] 

Chen Zhang commented on HDFS-14576:
---

Hi [~hexiaoqiao], we've meet similar problem on our production environment, 
thousands of DataNode report at almost the same time usually cause the NameNode 
full GC, our solution is to throttle the max concurrent FBR(e.g. 10), NameNode 
will reject extra FBR (by throwing an exception), if a DataNode receive the 
exception on it's first FBR, it will gracefully wait for a peiord of random 
time(in a given range) before retry.

This solution works very well for us, so I want to contribute the code to 
community, but when I porting this commit, I found that the latest version 
already support this, It's implemented by BlockReport Lease, see HDFS-7923

We also tried HDFS-6763 and HDFS-7097 that you mentioned, but I think the block 
report throttle strategy is much more helpful on NameNode restart
{quote}but in later CDH versions several patches have been backported that made 
the initial block report problem largely disappear. Unfortunately I don't have 
the list of Jiras and their relative impact
{quote}
FYI,[~sodonnell]

> Avoid block report retry and slow down namenode startup
> ---
>
> Key: HDFS-14576
> URL: https://issues.apache.org/jira/browse/HDFS-14576
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
>
> During namenode startup, the load will be very high since it has to process 
> every datanodes blockreport one by one. If there are hundreds datanodes block 
> reports pending process, the issue will be more serious even 
> #processFirstBlockReport is processed a lot more efficiently than ordinary 
> block reports. Then some of datanode will retry blockreport and lengthens 
> restart times. I think we should filter the block report request (via 
> datanode blockreport retries) which has be processed and return directly then 
> shorten down restart time. I want to state this proposal may be obvious only 
> for large cluster.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-06-26 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Status: Patch Available  (was: In Progress)

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, 
> HDFS-12967-003.patch, HDFS-12967-004.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-06-26 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Status: In Progress  (was: Patch Available)

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, 
> HDFS-12967-003.patch, HDFS-12967-004.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access

2019-06-26 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-12967:
--
Attachment: HDFS-12967-004.patch

> NNBench should support multi-cluster access
> ---
>
> Key: HDFS-12967
> URL: https://issues.apache.org/jira/browse/HDFS-12967
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, 
> HDFS-12967-003.patch, HDFS-12967-004.patch
>
>
> Sometimes we need to run NNBench for some scaling tests after made some 
> improvements on NameNode, so we have to deploy a new HDFS cluster and a new 
> Yarn cluster.
> If NNBench support multi-cluster access, we only need to deploy a new HDFS 
> test cluster and add it to existing YARN cluster, it'll make the scaling test 
> easier.
> Even more, if we want to do some A-B test, we have to run NNBench on 
> different HDFS clusters, this patch will be helpful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky

2019-08-12 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905212#comment-16905212
 ] 

Chen Zhang commented on HDFS-14654:
---

[~John Smith] sure, if set operation is synchronized, then get operation should 
also be synchronized.

> RBF: TestRouterRpc tests are flaky
> --
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-12 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905228#comment-16905228
 ] 

Chen Zhang commented on HDFS-14609:
---

Hi [~crh] [~tasanuma], I'm working on this Jira. I found that the branch 
HDFS-13891 was rebased after HDFS-14074 on trunk, which is committed after 
HADOOP-16314, so the UT still fail even when we run it on the branch HDFS-13891.

I've tried to rebase the HDFS-13891 to some older commit(at 12 Oct 2018), but 
there is too many conflicts.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-12 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905228#comment-16905228
 ] 

Chen Zhang edited comment on HDFS-14609 at 8/12/19 2:01 PM:


Hi [~crh] [~tasanuma], I'm working on this Jira. I found that the branch 
HDFS-13891 was rebased after HDFS-14074 on trunk, which is committed after 
HADOOP-16314, so the UT still fail even when we run it on the branch HDFS-13891.

I've tried to rebase the HDFS-13891 to some older commit(at 12 Oct 2018), but 
there are too many conflicts.


was (Author: zhangchen):
Hi [~crh] [~tasanuma], I'm working on this Jira. I found that the branch 
HDFS-13891 was rebased after HDFS-14074 on trunk, which is committed after 
HADOOP-16314, so the UT still fail even when we run it on the branch HDFS-13891.

I've tried to rebase the HDFS-13891 to some older commit(at 12 Oct 2018), but 
there is too many conflicts.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-12 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905349#comment-16905349
 ] 

Chen Zhang commented on HDFS-14609:
---

update: I switch to branch HDFS-13891 and reverted HADOOP-16314 and 
HADOOP-16354, then TestRouterWithSecureStartup can succeed but 
TestRouterHttpDelegationToken still fail.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-31 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897276#comment-16897276
 ] 

Chen Zhang commented on HDFS-14657:
---

Thanks [~sodonnell] for your analysis. I'm working on a complete solution on 
the trunk code this week.
{quote}However that is probably solvable, either by making the iterator keyed, 
and reopening it after acquiring the lock (or if it throws 
concurrentModificationException) at the correct position,
{quote}
Yes, this is the same solution I used in the next patch, the new patch is 
almost complete and I'm working on performance testing now. FBR process is much 
faster on the trunk, so you are right, I'll update the default value to a 
larger one, according to the test result.

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 
> RPC queue time, which impacts client latency, so we did some simple work on 
> refining the lock usage, which improved the p99 latency significantly.
> In our solution, BlockManager release the NameSystem write lock and request 
> it again for every 5000 blocks(by default) during processing FBR, with the 
> fair lock, all the RPC request can be processed before BlockManager 
> re-acquire the write lock.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14680) StorageInfoDefragmenter should handle exceptions gently

2019-07-31 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897201#comment-16897201
 ] 

Chen Zhang commented on HDFS-14680:
---

{quote}But We have this thing in CDH5 for a long time (at least 3 years) and 
not seeing this causing NN shutdown.

This is actually not true. It would shut down NN upon RuntimeException 
(unchecked exceptions) only.
{quote}
Agree. so it looks not worth much attention, so should we just leave the Jira 
here or resolve it as "not a problem" first?

> StorageInfoDefragmenter should handle exceptions gently
> ---
>
> Key: HDFS-14680
> URL: https://issues.apache.org/jira/browse/HDFS-14680
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Priority: Major
>
> StorageInfoDefragmenter is responsible for FoldedTreeSet compaction, but it 
> terminates NameNode on any exception, is it too radical?
> I mean, even the critical threads like HeartbeatManager don't terminates 
> NameNode once they encounter exceptions, StorageInfoDefragmenter should not 
> do that either.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14680) StorageInfoDefragmenter should handle exceptions gently

2019-07-31 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897201#comment-16897201
 ] 

Chen Zhang edited comment on HDFS-14680 at 7/31/19 1:52 PM:


{quote}But We have this thing in CDH5 for a long time (at least 3 years) and 
not seeing this causing NN shutdown.

This is actually not true. It would shut down NN upon RuntimeException 
(unchecked exceptions) only.
{quote}
Agree. so it looks not worth much attention, should we just leave the Jira here 
or resolve it as "not a problem" first?


was (Author: zhangchen):
{quote}But We have this thing in CDH5 for a long time (at least 3 years) and 
not seeing this causing NN shutdown.

This is actually not true. It would shut down NN upon RuntimeException 
(unchecked exceptions) only.
{quote}
Agree. so it looks not worth much attention, so should we just leave the Jira 
here or resolve it as "not a problem" first?

> StorageInfoDefragmenter should handle exceptions gently
> ---
>
> Key: HDFS-14680
> URL: https://issues.apache.org/jira/browse/HDFS-14680
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Priority: Major
>
> StorageInfoDefragmenter is responsible for FoldedTreeSet compaction, but it 
> terminates NameNode on any exception, is it too radical?
> I mean, even the critical threads like HeartbeatManager don't terminates 
> NameNode once they encounter exceptions, StorageInfoDefragmenter should not 
> do that either.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14680) StorageInfoDefragmenter should handle exceptions gently

2019-07-31 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896739#comment-16896739
 ] 

Chen Zhang edited comment on HDFS-14680 at 7/31/19 1:46 PM:


No I don't encounter this issue.

I'm working on HDFS-14657 and it relates with HDFS-9260, when I reading code of 
HDFS-9260, I found this design is too aggressive, StorageInfoDefragmenter 
should not shutdown NameNode on any exception, because it's not a critical 
thread, I'mean it should at least retry some times before shutdown NameNode, or 
maybe it can choose keep running no matter what exception happens, like 
HeartBeatManager.

We're upgrading our production cluster from 2.6 to 3.1, I don't want this 
happen to our NameNode, so it's just a proposal for discussion.


was (Author: zhangchen):
No I don't encounter this issue.

I'm working on HDFS-14657 and it relates with HDFS-9620, when I reading code of 
HDFS-9620, I found this design is too aggressive, StorageInfoDefragmenter 
should not shutdown NameNode on any exception, because it's not a critical 
thread, I'mean it should at least retry some times before shutdown NameNode, or 
maybe it can choose keep running no matter what exception happens, like 
HeartBeatManager.

We're upgrading our production cluster from 2.6 to 3.1, I don't want this 
happen to our NameNode, so it's just a proposal for discussion.

> StorageInfoDefragmenter should handle exceptions gently
> ---
>
> Key: HDFS-14680
> URL: https://issues.apache.org/jira/browse/HDFS-14680
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Priority: Major
>
> StorageInfoDefragmenter is responsible for FoldedTreeSet compaction, but it 
> terminates NameNode on any exception, is it too radical?
> I mean, even the critical threads like HeartbeatManager don't terminates 
> NameNode once they encounter exceptions, StorageInfoDefragmenter should not 
> do that either.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR

2019-07-31 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896917#comment-16896917
 ] 

Chen Zhang edited comment on HDFS-14657 at 7/31/19 3:20 PM:


Thanks [~shv], but sorry I can't see any problem of this change on 2.6 version.
{quote}I believe when you release the lock while iterating over the storage 
blocks, the iterator may find itself in an isolated chain of the list after 
reacquiring the lock
{quote}
It won't happen, because processReport don't iterate the storage blocks at 2.6, 
the whole FBR procedure(for each storage) can be simplified like this: 
| # Insert a delimiter into the head of block list(triplets, it's actually a 
double linked list, so I'll ref it as the block list for simplification) of 
this storage.
 # Start a loop, iterate through block report
 ## Get a block from the report
 ## Using the block to get the stored BlockInfo object from BlockMap
 ## Check the status of the block, and add the block to corresponding 
set(toAdd, toUc, toInvalidate, toCorrupt)
 ## Move the block to the head of block list(which makes the block placed 
before delimiter)
 # Start a loop to iterate through block list, find the blocks after delimiter, 
add them to toRemove set.|

My proposal in this Jira is to release and re-acquire NN lock between step 2.3 
and step 2.4. This solution won't affect the correctness of block report 
procedure for the following reasons:
 # At last, all the reported block will be moved before delimiter.
 # If any other thread get the NN lock before 2.4 add adds some new blocks, 
they will be added in the head of list.
 # If any other thread get the NN lock before 2.4 and removes some blocks, it 
won't affect the loop at 2nd step. (Pls notice that the delimiter can't be 
remove by other threads)
 # All the blocks after delimiter should be removed

According to the reasons described above, the following problem you mentioned 
also won't happen:
{quote}you may remove replicas that were not supposed to be removed
{quote}
 

I agree with you that the  things are tricky here, but this change is quite 
simple and I think we still can make clear the impaction.


was (Author: zhangchen):
Thanks [~shv], but sorry I can't see any problem of this change on 2.6 version.
{quote}I believe when you release the lock while iterating over the storage 
blocks, the iterator may find itself in an isolated chain of the list after 
reacquiring the lock
{quote}
It won't happen, because processReport don't iterate the storage blocks at 2.6, 
the whole FBR procedure(for each storage) can be simplified like this: 
| # Insert a delimiter into the head of block list(triplets, it's actually a 
double linked list, so I'll ref it as the block list for simplification) of 
this storage.
 # Start a loop, iterate through block report
 ## Get a block from the report
 ## Using the block to get the stored BlockInfo object from BlockMap
 ## Check the status of the block, and add the block to corresponding 
set(toAdd, toUc, toInvalidate, toCorrupt)
 ## Move the block to the head of block list(which makes the block placed 
before delimiter)
 # Start a loop to iterate through block list, find the blocks after delimiter, 
add them to toRemove set.|

My proposal in this Jira is to release and re-acquire NN lock between 2.3 and 
2.4. This solution won't affect the correctness of block report procedure for 
the following reasons:
 # At last, all the reported block will be moved before delimiter.
 # If any other thread acquire the NN lock before 2.4 add adds some new blocks, 
they will be added in the head of list.
 # If any other thread acquire the NN lock before 2.4 and removes some blocks, 
it won't affect the loop at 2nd step. (Pls notice that the delimiter can't be 
remove by other threads)
 # All the blocks after delimiter should be removed

According to the reasons described above, the following problem you mentioned 
also won't happen:
{quote}you may remove replicas that were not supposed to be removed
{quote}
 

I agree with you that the  things are tricky here, but this change is quite 
simple and I think we still can make clear the impaction.

> Refine NameSystem lock usage during processing FBR
> --
>
> Key: HDFS-14657
> URL: https://issues.apache.org/jira/browse/HDFS-14657
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch
>
>
> The disk with 12TB capacity is very normal today, which means the FBR size is 
> much larger than before, Namenode holds the NameSystemLock during processing 
> block report for each storage, which might take quite a long time.
> On our production environment, processing large FBR usually cause a longer 

[jira] [Comment Edited] (HDFS-14677) TestDataNodeHotSwapVolumes#testAddVolumesConcurrently fails intermittently in trunk

2019-07-29 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895746#comment-16895746
 ] 

Chen Zhang edited comment on HDFS-14677 at 7/30/19 3:56 AM:


Thanks [~ayushtkn] [~elgoiri] for your comments

I've run the test repeatedly for 1000 times without the patch, it failed 3 
times, and all these 3 failures are caused by NullPointerException.

After patch, it didn't fail.


was (Author: zhangchen):
Thanks [~ayushtkn] [~elgoiri] for your comments

I've run the test repeatedly for 1000 times without the patch, it failed 3 
times. After patch, it didn't fail.

> TestDataNodeHotSwapVolumes#testAddVolumesConcurrently fails intermittently in 
> trunk
> ---
>
> Key: HDFS-14677
> URL: https://issues.apache.org/jira/browse/HDFS-14677
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14677.001.patch
>
>
> Stacktrace:
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testAddVolumesConcurrently(TestDataNodeHotSwapVolumes.java:615)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> see: 
> [https://builds.apache.org/job/PreCommit-HDFS-Build/27328/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testAddVolumesConcurrently/]
> and 
> [https://builds.apache.org/job/PreCommit-HDFS-Build/27312/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testAddVolumesConcurrently/]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14677) TestDataNodeHotSwapVolumes#testAddVolumesConcurrently fails intermittently in trunk

2019-07-29 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895746#comment-16895746
 ] 

Chen Zhang commented on HDFS-14677:
---

Thanks [~ayushtkn] [~elgoiri] for your comments

I've run the test repeatedly for 1000 times without the patch, it failed 3 
times. After patch, it didn't fail.

> TestDataNodeHotSwapVolumes#testAddVolumesConcurrently fails intermittently in 
> trunk
> ---
>
> Key: HDFS-14677
> URL: https://issues.apache.org/jira/browse/HDFS-14677
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14677.001.patch
>
>
> Stacktrace:
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testAddVolumesConcurrently(TestDataNodeHotSwapVolumes.java:615)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> see: 
> [https://builds.apache.org/job/PreCommit-HDFS-Build/27328/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testAddVolumesConcurrently/]
> and 
> [https://builds.apache.org/job/PreCommit-HDFS-Build/27312/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testAddVolumesConcurrently/]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-12 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904918#comment-16904918
 ] 

Chen Zhang commented on HDFS-13709:
---

Thanks [~jojochuang] for mentioning me at HDFS-14706,
This Jira and HDFS-14706 both introduce the reportBadBlock in different places, 
I agree with you that we need to reuse the logic of handle bad blocks.

I've added a method \{{handleBadBlock}} in DataNode to handle bad-blocks, using 
the following logic:
 # If it's called by scanner, then reportBadBlock to NN at any time
 # If it's the exception from other way(e.g. BlockSender), will first identify 
whether it's a bad block according to the type of exception. If it's a bad 
block, then try to markSuspectBlock if blockScanner is enabled, or report to NN 
if scanner disabled
 # I leave some specific logic in the 
\{{VolumeScanner#ScanResultHandler.handle()}} method, I think they are only 
related with scanner, not all situation

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-12 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-13709:
--
Attachment: HDFS-13709.002.patch

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14735) File could only be replicated to 0 nodes instead of minReplication (=1)

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907776#comment-16907776
 ] 

Chen Zhang commented on HDFS-14735:
---

How much spaces left in your cluster? You can enable the debug log to check why 
the allocation failed

> File could only be replicated to 0 nodes instead of minReplication (=1)
> ---
>
> Key: HDFS-14735
> URL: https://issues.apache.org/jira/browse/HDFS-14735
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tatyana Alexeyev
>Priority: Major
>
> Hello I have intermitent error when running my EMR Hadoop Cluster:
> "Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /user/sphdadm/_sqoop/00501bd7b05e4182b5006b9d51 
> bafb7f_f405b2f3/_temporary/1/_temporary/attempt_1565136887564_20057_m_00_0/part-m-0.snappy
>  could only be replicated to 0 nodes instead of minReplication (=1). There 
> are 5 datanode(s) running and no node(s) are excluded in this operation."
> I am running Hadoop version 
> sphdadm@ip-10-6-15-108 hadoop]$ hadoop version
> Hadoop 2.8.5-amzn-4
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-14 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Attachment: HDFS-14609.002.patch

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907754#comment-16907754
 ] 

Chen Zhang commented on HDFS-14609:
---

Uploaded patch v2, fix checkstyle and whitespace errors

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907779#comment-16907779
 ] 

Chen Zhang commented on HDFS-13709:
---

uploaded patch v4 to fix checkstyle and asflicense error, also fixed a failed ut

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-14 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-13709:
--
Attachment: HDFS-13709.004.patch

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14654) RBF: TestRouterRpc tests are flaky

2019-08-14 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14654:
--
Attachment: HDFS-14654.004.patch

> RBF: TestRouterRpc tests are flaky
> --
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, 
> HDFS-14654.003.patch, HDFS-14654.004.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907796#comment-16907796
 ] 

Chen Zhang commented on HDFS-14654:
---

Uploaded patch v4 to fix checkstyle error.

> RBF: TestRouterRpc tests are flaky
> --
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, 
> HDFS-14654.003.patch, HDFS-14654.004.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-16 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909230#comment-16909230
 ] 

Chen Zhang commented on HDFS-14609:
---

Hi [~crh], do you have time to help review the patch? Also cc [~aagrawal] and 
[~elgoiri]

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14742) RBF:TestRouterFaultTolerant tests are flaky

2019-08-16 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang reassigned HDFS-14742:
-

Assignee: Chen Zhang

> RBF:TestRouterFaultTolerant tests are flaky
> ---
>
> Key: HDFS-14742
> URL: https://issues.apache.org/jira/browse/HDFS-14742
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> [https://builds.apache.org/job/PreCommit-HDFS-Build/27516/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt]
> {code:java}
> [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.665 
> s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.federation.router.TestRouterFaultTolerant
> [ERROR] 
> testWriteWithFailedSubcluster(org.apache.hadoop.hdfs.server.federation.router.TestRouterFaultTolerant)
>   Time elapsed: 3.516 s  <<< FAILURE!
> java.lang.AssertionError: 
> Failed to run "Full tests": 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Cannot find 
> locations for /HASH_ALL-failsubcluster, because the default nameservice is 
> disabled to read or write
>   at 
> org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.lookupLocation(MountTableResolver.java:425)
>   at 
> org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver$1.call(MountTableResolver.java:391)
>   at 
> org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver$1.call(MountTableResolver.java:388)
>   at 
> com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4876)
>   at 
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3528)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2277)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3952)
>   at 
> com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871)
>   at 
> org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.getDestinationForPath(MountTableResolver.java:394)
>   at 
> org.apache.hadoop.hdfs.server.federation.resolver.MultipleDestinationMountTableResolver.getDestinationForPath(MultipleDestinationMountTableResolver.java:87)
>   at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getLocationsForPath(RouterRpcServer.java:1498)
>   at 
> org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getListing(RouterClientProtocol.java:734)
>   at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getListing(RouterRpcServer.java:827)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:732)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)
>   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1499)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1396)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>   at com.sun.proxy.$Proxy35.getListing(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:678)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>   at 
> 

[jira] [Commented] (HDFS-14706) Checksums are not checked if block meta file is less than 7 bytes

2019-08-16 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909207#comment-16909207
 ] 

Chen Zhang commented on HDFS-14706:
---

Thanks [~sodonnell], what's your opinion? [~jojochuang]

> Checksums are not checked if block meta file is less than 7 bytes
> -
>
> Key: HDFS-14706
> URL: https://issues.apache.org/jira/browse/HDFS-14706
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Attachments: HDFS-14706.001.patch, HDFS-14706.002.patch
>
>
> If a block and its meta file are corrupted in a certain way, the corruption 
> can go unnoticed by a client, causing it to return invalid data.
> The meta file is expected to always have a header of 7 bytes and then a 
> series of checksums depending on the length of the block.
> If the metafile gets corrupted in such a way, that it is between zero and 
> less than 7 bytes in length, then the header is incomplete. In 
> BlockSender.java the logic checks if the metafile length is at least the size 
> of the header and if it is not, it does not error, but instead returns a NULL 
> checksum type to the client.
> https://github.com/apache/hadoop/blob/b77761b0e37703beb2c033029e4c0d5ad1dce794/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java#L327-L357
> If the client receives a NULL checksum client, it will not validate checksums 
> at all, and even corrupted data will be returned to the reader. This means 
> this corrupt will go unnoticed and HDFS will never repair it. Even the Volume 
> Scanner will not notice the corruption as the checksums are silently ignored.
> Additionally, if the meta file does have enough bytes so it attempts to load 
> the header, and the header is corrupted such that it is not valid, it can 
> cause the datanode Volume Scanner to exit, which an exception like the 
> following:
> {code}
> 2019-08-06 18:16:39,151 ERROR datanode.VolumeScanner: 
> VolumeScanner(/tmp/hadoop-sodonnell/dfs/data, 
> DS-7f103313-61ba-4d37-b63d-e8cf7d2ed5f7) exiting because of exception 
> java.lang.IllegalArgumentException: id=51 out of range [0, 5)
>   at 
> org.apache.hadoop.util.DataChecksum$Type.valueOf(DataChecksum.java:76)
>   at 
> org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:167)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:173)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:139)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:153)
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.loadLastPartialChunkChecksum(FsVolumeImpl.java:1140)
>   at 
> org.apache.hadoop.hdfs.server.datanode.FinalizedReplica.loadLastPartialChunkChecksum(FinalizedReplica.java:157)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getPartialChunkChecksumForFinalized(BlockSender.java:451)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:266)
>   at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:446)
>   at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558)
>   at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
> 2019-08-06 18:16:39,152 INFO datanode.VolumeScanner: 
> VolumeScanner(/tmp/hadoop-sodonnell/dfs/data, 
> DS-7f103313-61ba-4d37-b63d-e8cf7d2ed5f7) exiting.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky

2019-08-16 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909222#comment-16909222
 ] 

Chen Zhang commented on HDFS-14654:
---

[~elgoiri] this failure seems unrelated with this patch.

We can find it fails at other Jira : 
https://issues.apache.org/jira/browse/HDFS-14728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=16909046#comment-16909046

The test can pass at local, we may find another flaky test, I'll try to track 
it in other Jira. Thanks.

> RBF: TestRouterRpc tests are flaky
> --
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, 
> HDFS-14654.003.patch, HDFS-14654.004.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14742) RBF:TestRouterFaultTolerant tests are flaky

2019-08-16 Thread Chen Zhang (JIRA)
Chen Zhang created HDFS-14742:
-

 Summary: RBF:TestRouterFaultTolerant tests are flaky
 Key: HDFS-14742
 URL: https://issues.apache.org/jira/browse/HDFS-14742
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Chen Zhang


[https://builds.apache.org/job/PreCommit-HDFS-Build/27516/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt]
{code:java}
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.665 s 
<<< FAILURE! - in 
org.apache.hadoop.hdfs.server.federation.router.TestRouterFaultTolerant
[ERROR] 
testWriteWithFailedSubcluster(org.apache.hadoop.hdfs.server.federation.router.TestRouterFaultTolerant)
  Time elapsed: 3.516 s  <<< FAILURE!
java.lang.AssertionError: 
Failed to run "Full tests": 
org.apache.hadoop.ipc.RemoteException(java.io.IOException): Cannot find 
locations for /HASH_ALL-failsubcluster, because the default nameservice is 
disabled to read or write
at 
org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.lookupLocation(MountTableResolver.java:425)
at 
org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver$1.call(MountTableResolver.java:391)
at 
org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver$1.call(MountTableResolver.java:388)
at 
com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4876)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3528)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2277)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
at com.google.common.cache.LocalCache.get(LocalCache.java:3952)
at 
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871)
at 
org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.getDestinationForPath(MountTableResolver.java:394)
at 
org.apache.hadoop.hdfs.server.federation.resolver.MultipleDestinationMountTableResolver.getDestinationForPath(MultipleDestinationMountTableResolver.java:87)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getLocationsForPath(RouterRpcServer.java:1498)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getListing(RouterClientProtocol.java:734)
at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getListing(RouterRpcServer.java:827)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:732)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
at org.apache.hadoop.ipc.Client.call(Client.java:1499)
at org.apache.hadoop.ipc.Client.call(Client.java:1396)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy35.getListing(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:678)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 

[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-12 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905724#comment-16905724
 ] 

Chen Zhang commented on HDFS-14609:
---

[~crh] Sure I'll work on trunk to fix these tests.

Just wondering why Eric and Takanobu both run these tests failed after revert 
or switch to branch HDFS-13891, so I did some digging works

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14728) RBF:GetDatanodeReport causes a large GC pressure on the NameNodes

2019-08-13 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906049#comment-16906049
 ] 

Chen Zhang commented on HDFS-14728:
---

NamenodeBeanMetrics has implemented a dnCache to cache DataNode report from 
NameNode, I think we can reuse it for GetDatanodeReport rpc

> RBF:GetDatanodeReport causes a large GC pressure on the NameNodes
> -
>
> Key: HDFS-14728
> URL: https://issues.apache.org/jira/browse/HDFS-14728
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: xuzq
>Priority: Major
>
> When a cluster contains millions of DNs, *GetDatanodeReport* is pretty 
> expensive, and it will cause a large GC pressure on NameNode.
> When multiple NSs share the millions DNs by federation and the router listens 
> to the NSs, the problem will be more serious.
> All the NSs will be GC at the same time.
> RBF should cache the datanode report informations and have an option to 
> disable the cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14654) RBF: TestRouterRpc tests are flaky

2019-08-11 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14654:
--
Attachment: HDFS-14654.002.patch

> RBF: TestRouterRpc tests are flaky
> --
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14711) RBF: RBFMetrics throws NullPointerException if stateStore disabled

2019-08-11 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904681#comment-16904681
 ] 

Chen Zhang commented on HDFS-14711:
---

Hi [~ayushtkn], agree with you, we need to add some NULL check. So this Jira 
looks duplicate with HDFS-14656.

> RBF: RBFMetrics throws NullPointerException if stateStore disabled
> --
>
> Key: HDFS-14711
> URL: https://issues.apache.org/jira/browse/HDFS-14711
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14711.001.patch
>
>
> In current implementation, if \{{stateStore}} initialize fail, only log an 
> error message. Actually RBFMetrics can't work normally at this time.
> {code:java}
> 2019-08-08 22:43:58,024 [qtp812446698-28] ERROR jmx.JMXJsonServlet 
> (JMXJsonServlet.java:writeAttribute(345)) - getting attribute FilesTotal of 
> Hadoop:service=NameNode,name=FSNamesystem-2 threw an exception
> javax.management.RuntimeMBeanException: java.lang.NullPointerException
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852)
> at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651)
> at 
> com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
> at 
> org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338)
> at org.apache.hadoop.jmx.JMXJsonServlet.listBeans(JMXJsonServlet.java:316)
> at org.apache.hadoop.jmx.JMXJsonServlet.doGet(JMXJsonServlet.java:210)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at 
> org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilter.doFilter(ProxyUserAuthenticationFilter.java:104)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:51)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:539)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
> at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> 

[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky

2019-08-11 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904680#comment-16904680
 ] 

Chen Zhang commented on HDFS-14654:
---

Thanks [~elgoiri] for your review. I've added some javadocs explaining in patch 
v2, not sure is it enough, please help to review again.

> RBF: TestRouterRpc tests are flaky
> --
>
> Key: HDFS-14654
> URL: https://issues.apache.org/jira/browse/HDFS-14654
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, error.log
>
>
> They sometimes pass and sometimes fail.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14714) RBF: implement getReplicatedBlockStats interface

2019-08-19 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14714:
--
Status: Patch Available  (was: Open)

> RBF: implement getReplicatedBlockStats interface
> 
>
> Key: HDFS-14714
> URL: https://issues.apache.org/jira/browse/HDFS-14714
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14714.001.patch
>
>
> It's not implemented now, we sometime need this interface for cluster monitor
> {code:java}
> // current implementation
> public ReplicatedBlockStats getReplicatedBlockStats() throws IOException {
>   rpcServer.checkOperation(NameNode.OperationCategory.READ, false);
>   return null;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14714) RBF: implement getReplicatedBlockStats interface

2019-08-19 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910546#comment-16910546
 ] 

Chen Zhang commented on HDFS-14714:
---

Thanks [~ayushtkn] for your suggestion, uploaded the patch v1

> RBF: implement getReplicatedBlockStats interface
> 
>
> Key: HDFS-14714
> URL: https://issues.apache.org/jira/browse/HDFS-14714
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14714.001.patch
>
>
> It's not implemented now, we sometime need this interface for cluster monitor
> {code:java}
> // current implementation
> public ReplicatedBlockStats getReplicatedBlockStats() throws IOException {
>   rpcServer.checkOperation(NameNode.OperationCategory.READ, false);
>   return null;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14714) RBF: implement getReplicatedBlockStats interface

2019-08-19 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14714:
--
Attachment: HDFS-14714.001.patch

> RBF: implement getReplicatedBlockStats interface
> 
>
> Key: HDFS-14714
> URL: https://issues.apache.org/jira/browse/HDFS-14714
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14714.001.patch
>
>
> It's not implemented now, we sometime need this interface for cluster monitor
> {code:java}
> // current implementation
> public ReplicatedBlockStats getReplicatedBlockStats() throws IOException {
>   rpcServer.checkOperation(NameNode.OperationCategory.READ, false);
>   return null;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14752) backport HDFS-13709 to branch-2

2019-08-19 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14752:
--
Status: Patch Available  (was: Open)

uploaded the patch v1

> backport HDFS-13709 to branch-2
> ---
>
> Key: HDFS-14752
> URL: https://issues.apache.org/jira/browse/HDFS-14752
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14752.branch-2.001.patch
>
>
> backport HDFS-13709 (Report bad block to NN when transfer block encounter EIO 
> exception) to branch-2



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-19 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910906#comment-16910906
 ] 

Chen Zhang commented on HDFS-13709:
---

Thanks [~jojochuang] for reviewing this patch and merging it.

I'll provide a branch-2 patch later, btw, I've a few questions about this:
 # In which case we need to backport the patch to branch-2? Usually the bugfix 
and some critical improvements?
 # Some people open a new Jira to backport to branch-2, some update a new patch 
in the same Jira, which is better in the practice?

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-19 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911008#comment-16911008
 ] 

Chen Zhang commented on HDFS-13709:
---

Created a new Jira HDFS-14752 to track the branch-2 backport

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14752) backport HDFS-13709 to branch-2

2019-08-19 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14752:
--
Attachment: HDFS-14752.branch-2.001.patch

> backport HDFS-13709 to branch-2
> ---
>
> Key: HDFS-14752
> URL: https://issues.apache.org/jira/browse/HDFS-14752
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14752.branch-2.001.patch
>
>
> backport HDFS-13709 (Report bad block to NN when transfer block encounter EIO 
> exception) to branch-2



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14735) File could only be replicated to 0 nodes instead of minReplication (=1)

2019-08-19 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910897#comment-16910897
 ] 

Chen Zhang commented on HDFS-14735:
---

Sorry for delayed response, [~talexey], you can grep the keyword "is not 
chosen" in NameNode's log, it will tell you why the nodes can't be allocated.

> File could only be replicated to 0 nodes instead of minReplication (=1)
> ---
>
> Key: HDFS-14735
> URL: https://issues.apache.org/jira/browse/HDFS-14735
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tatyana Alexeyev
>Priority: Major
>
> Hello I have intermitent error when running my EMR Hadoop Cluster:
> "Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
> /user/sphdadm/_sqoop/00501bd7b05e4182b5006b9d51 
> bafb7f_f405b2f3/_temporary/1/_temporary/attempt_1565136887564_20057_m_00_0/part-m-0.snappy
>  could only be replicated to 0 nodes instead of minReplication (=1). There 
> are 5 datanode(s) running and no node(s) are excluded in this operation."
> I am running Hadoop version 
> sphdadm@ip-10-6-15-108 hadoop]$ hadoop version
> Hadoop 2.8.5-amzn-4
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14752) backport HDFS-13709 to branch-2

2019-08-19 Thread Chen Zhang (Jira)
Chen Zhang created HDFS-14752:
-

 Summary: backport HDFS-13709 to branch-2
 Key: HDFS-14752
 URL: https://issues.apache.org/jira/browse/HDFS-14752
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chen Zhang


backport HDFS-13709 (Report bad block to NN when transfer block encounter EIO 
exception) to branch-2



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14752) backport HDFS-13709 to branch-2

2019-08-19 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang reassigned HDFS-14752:
-

Assignee: Chen Zhang

> backport HDFS-13709 to branch-2
> ---
>
> Key: HDFS-14752
> URL: https://issues.apache.org/jira/browse/HDFS-14752
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> backport HDFS-13709 (Report bad block to NN when transfer block encounter EIO 
> exception) to branch-2



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-19 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910922#comment-16910922
 ] 

Chen Zhang commented on HDFS-14609:
---

Thanks [~crh] for your comments.
{quote}It's not clear why hadoop-hdfs changes are needed in this context.
{quote}
 HDFS-14434 ignores user.name query parameter in secure WebHDFS, but the 
PseudoAuthenticationHandler can still leverage this parameter to pass the 
kerberos authentication, as you mentioned before:
{quote} we may need to modify the test to inject an appropriate no auth filter 
and bypass auth to maintain the rationale behind the test
{quote}
If we want to bypass the kerberos authentication, we have to use the user.name 
parameter, and now the only way we can do this is to send request through URL 
directly, instead of through \{{WebHdfsFileSystem}}. So we have to do some work 
to process request and response, I want to reuse the logic in 
\{{WebHdfsFileSystem}}, but lots of interface can't be accessed out of package, 
so I have to expose these interfaces through \{{WebHdfsTestUtil}}, that's why 
we need to modify the hadoop-hdfs project.

Do you think it make sense?

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14714) RBF: implement getReplicatedBlockStats interface

2019-08-19 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14714:
--
Attachment: HDFS-14714.002.patch

> RBF: implement getReplicatedBlockStats interface
> 
>
> Key: HDFS-14714
> URL: https://issues.apache.org/jira/browse/HDFS-14714
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14714.001.patch, HDFS-14714.002.patch
>
>
> It's not implemented now, we sometime need this interface for cluster monitor
> {code:java}
> // current implementation
> public ReplicatedBlockStats getReplicatedBlockStats() throws IOException {
>   rpcServer.checkOperation(NameNode.OperationCategory.READ, false);
>   return null;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14714) RBF: implement getReplicatedBlockStats interface

2019-08-19 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911018#comment-16911018
 ] 

Chen Zhang commented on HDFS-14714:
---

uploaded patch v2 to fix checkstyle and whitespace error

> RBF: implement getReplicatedBlockStats interface
> 
>
> Key: HDFS-14714
> URL: https://issues.apache.org/jira/browse/HDFS-14714
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14714.001.patch, HDFS-14714.002.patch
>
>
> It's not implemented now, we sometime need this interface for cluster monitor
> {code:java}
> // current implementation
> public ReplicatedBlockStats getReplicatedBlockStats() throws IOException {
>   rpcServer.checkOperation(NameNode.OperationCategory.READ, false);
>   return null;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-08-20 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14730:
--
Summary: Remove unused configuration dfs.web.authentication.filter   (was: 
Deprecate configuration dfs.web.authentication.filter )

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-08-20 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14730:
--
Attachment: HDFS-14730.001.patch

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-08-20 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14730:
--
Status: Patch Available  (was: Open)

uploaded patch v1

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-08-20 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911027#comment-16911027
 ] 

Chen Zhang commented on HDFS-14730:
---

{\{TestRouterHttpDelegationToken}} still using this configuration, but it's 
actually not work. HDFS-14609 try to fix failed test 
\{{TestRouterHttpDelegationToken}} and not use this configuration any more, 
this Jira should commit after HDFS-14609.

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-20 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Status: Patch Available  (was: Open)

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-20 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Status: Open  (was: Patch Available)

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter

2019-08-20 Thread Chen Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14730:
--
Status: Open  (was: Patch Available)

> Remove unused configuration dfs.web.authentication.filter 
> --
>
> Key: HDFS-14730
> URL: https://issues.apache.org/jira/browse/HDFS-14730
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14730.001.patch
>
>
> After HADOOP-16314, this configuration is not used any where, so I propose to 
> deprecate it to avoid misuse.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14714) RBF: implement getReplicatedBlockStats interface

2019-08-20 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911108#comment-16911108
 ] 

Chen Zhang commented on HDFS-14714:
---

Hi [~ayushtkn], can you help to review the patch? Thanks.

> RBF: implement getReplicatedBlockStats interface
> 
>
> Key: HDFS-14714
> URL: https://issues.apache.org/jira/browse/HDFS-14714
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14714.001.patch, HDFS-14714.002.patch
>
>
> It's not implemented now, we sometime need this interface for cluster monitor
> {code:java}
> // current implementation
> public ReplicatedBlockStats getReplicatedBlockStats() throws IOException {
>   rpcServer.checkOperation(NameNode.OperationCategory.READ, false);
>   return null;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907486#comment-16907486
 ] 

Chen Zhang commented on HDFS-14609:
---

Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very 
helpful.

I've fixed these 2 tests, here is some detail;
h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal

HADOOP-16314 and HADOOP-16354 made some changes which breaks the test:
 # Added an AuthFilterInitializer, which using 
{{hadoop.http.authentication.kerberos.*}} instead of 
{{dfs.web.authentication.kerberos.*}} to initialize kerberos
 # {{hadoop.http.authentication.kerberos.principal}} has a default value, so 
even we don't configure this key, the cluster will still start normally

h3. TestRouterHttpDelegationToken
 # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the 
initial version of this test leveraged this parameter to bypass the kerberos 
authentication, so after HDFS-14434, it's not work. I added a set of methods to 
send request by http connection instead of {{WebHdfsFileSystem}} to make it 
continue working.
 # HADOOP-16314 changed configuration-key of the authentication filter from 
{{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I 
added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}}
 # For case {{testGetDelegationToken()}}, the server address is set by 
WebHdfsFileSystem after it get the response, the original address is the 
address of RouterRpcServer. Since we now send request by http connection 
directly, it's unnecessary to reset the address, so I removed this assert
 # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception 
is also generated by WebHdfsFileSystem and the logic is very complex, I think 
it's also unnecessary to keep this assert, so I using the 403 detection instead.

 

In the trunk code, the config {{dfs.web.authentication.filter}} is not used 
anywhere, I propose to deprecate this config, I'll track this in another Jira.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907486#comment-16907486
 ] 

Chen Zhang edited comment on HDFS-14609 at 8/14/19 5:45 PM:


Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very 
helpful.

I've fixed these 2 tests, here is some detail;
h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal

HADOOP-16314 and HADOOP-16354 made some changes which breaks the test:
 # Added an AuthFilterInitializer, which using 
{{hadoop.http.authentication.kerberos.**}} ** instead of 
{{dfs.web.authentication.kerberos}}*{{*.*}}* to initialize kerberos
 # {{hadoop.http.authentication.kerberos.principal}} has a default value, so 
even we don't configure this key, the cluster will still start normally

h3. TestRouterHttpDelegationToken
 # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the 
initial version of this test leveraged this parameter to bypass the kerberos 
authentication, so after HDFS-14434, it's not work. I added a set of methods to 
send request by http connection instead of {{WebHdfsFileSystem}} to make it 
continue working.
 # HADOOP-16314 changed configuration-key of the authentication filter from 
{{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I 
added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}}
 # For case {{testGetDelegationToken()}}, the server address is set by 
WebHdfsFileSystem after it get the response, the original address is the 
address of RouterRpcServer. Since we now send request by http connection 
directly, it's unnecessary to reset the address, so I removed this assert
 # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception 
is also generated by WebHdfsFileSystem and the logic is very complex, I think 
it's also unnecessary to keep this assert, so I using the 403 detection instead.

 

In the trunk code, the config {{dfs.web.authentication.filter}} is not used 
anywhere, I propose to deprecate this config, I'll track this in another Jira.


was (Author: zhangchen):
Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very 
helpful.

I've fixed these 2 tests, here is some detail;
h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal

HADOOP-16314 and HADOOP-16354 made some changes which breaks the test:
 # Added an AuthFilterInitializer, which using 
{{hadoop.http.authentication.kerberos.*}} instead of 
{{dfs.web.authentication.kerberos.*}} to initialize kerberos
 # {{hadoop.http.authentication.kerberos.principal}} has a default value, so 
even we don't configure this key, the cluster will still start normally

h3. TestRouterHttpDelegationToken
 # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the 
initial version of this test leveraged this parameter to bypass the kerberos 
authentication, so after HDFS-14434, it's not work. I added a set of methods to 
send request by http connection instead of {{WebHdfsFileSystem}} to make it 
continue working.
 # HADOOP-16314 changed configuration-key of the authentication filter from 
{{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I 
added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}}
 # For case {{testGetDelegationToken()}}, the server address is set by 
WebHdfsFileSystem after it get the response, the original address is the 
address of RouterRpcServer. Since we now send request by http connection 
directly, it's unnecessary to reset the address, so I removed this assert
 # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception 
is also generated by WebHdfsFileSystem and the logic is very complex, I think 
it's also unnecessary to keep this assert, so I using the 403 detection instead.

 

In the trunk code, the config {{dfs.web.authentication.filter}} is not used 
anywhere, I propose to deprecate this config, I'll track this in another Jira.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Comment Edited] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907486#comment-16907486
 ] 

Chen Zhang edited comment on HDFS-14609 at 8/14/19 5:46 PM:


Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very 
helpful.

I've fixed these 2 tests, here is some detail;
h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal

HADOOP-16314 and HADOOP-16354 made some changes which breaks the test:
 # Added an AuthFilterInitializer, which using 
{{hadoop.http.authentication.kerberos.\***}} **  ** instead of 
{{dfs.web.authentication.kerberos}}{{.\}}* to initialize kerberos
 # {{hadoop.http.authentication.kerberos.principal}} has a default value, so 
even we don't configure this key, the cluster will still start normally

h3. TestRouterHttpDelegationToken
 # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the 
initial version of this test leveraged this parameter to bypass the kerberos 
authentication, so after HDFS-14434, it's not work. I added a set of methods to 
send request by http connection instead of {{WebHdfsFileSystem}} to make it 
continue working.
 # HADOOP-16314 changed configuration-key of the authentication filter from 
{{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I 
added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}}
 # For case {{testGetDelegationToken()}}, the server address is set by 
WebHdfsFileSystem after it get the response, the original address is the 
address of RouterRpcServer. Since we now send request by http connection 
directly, it's unnecessary to reset the address, so I removed this assert
 # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception 
is also generated by WebHdfsFileSystem and the logic is very complex, I think 
it's also unnecessary to keep this assert, so I using the 403 detection instead.

 

In the trunk code, the config {{dfs.web.authentication.filter}} is not used 
anywhere, I propose to deprecate this config, I'll track this in another Jira.


was (Author: zhangchen):
Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very 
helpful.

I've fixed these 2 tests, here is some detail;
h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal

HADOOP-16314 and HADOOP-16354 made some changes which breaks the test:
 # Added an AuthFilterInitializer, which using 
{{hadoop.http.authentication.kerberos.**}} ** instead of 
{{dfs.web.authentication.kerberos}}*{{*.*}}* to initialize kerberos
 # {{hadoop.http.authentication.kerberos.principal}} has a default value, so 
even we don't configure this key, the cluster will still start normally

h3. TestRouterHttpDelegationToken
 # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the 
initial version of this test leveraged this parameter to bypass the kerberos 
authentication, so after HDFS-14434, it's not work. I added a set of methods to 
send request by http connection instead of {{WebHdfsFileSystem}} to make it 
continue working.
 # HADOOP-16314 changed configuration-key of the authentication filter from 
{{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I 
added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}}
 # For case {{testGetDelegationToken()}}, the server address is set by 
WebHdfsFileSystem after it get the response, the original address is the 
address of RouterRpcServer. Since we now send request by http connection 
directly, it's unnecessary to reset the address, so I removed this assert
 # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception 
is also generated by WebHdfsFileSystem and the logic is very complex, I think 
it's also unnecessary to keep this assert, so I using the 403 detection instead.

 

In the trunk code, the config {{dfs.web.authentication.filter}} is not used 
anywhere, I propose to deprecate this config, I'll track this in another Jira.

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter

2019-08-14 Thread Chen Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhang updated HDFS-14609:
--
Attachment: HDFS-14609.001.patch

> RBF: Security should use common AuthenticationFilter
> 
>
> Key: HDFS-14609
> URL: https://issues.apache.org/jira/browse/HDFS-14609
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: CR Hota
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-14609.001.patch
>
>
> We worked on router based federation security as part of HDFS-13532. We kept 
> it compatible with the way namenode works. However with HADOOP-16314 and 
> HDFS-16354 in trunk, auth filters seems to have been changed causing tests to 
> fail.
> Changes are needed appropriately in RBF, mainly fixing broken tests.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   >