[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837255#comment-15837255 ] Chen Zhang commented on HDFS-11303: --- Got it, thanks a lot for your explanation, Stack. > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10927) Lease Recovery: File not getting closed on HDFS when block write operation fails
[ https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871310#comment-15871310 ] Chen Zhang commented on HDFS-10927: --- Hey guys, I just found an issue like this. HBase RegionServer got an DiskFull exception while writing WAL files, and client failed after several times retry. When Master trying to use recoverLease to recover these file, we got almost same logs as [~ngoswami] attached {quote} java.io.IOException: File length mismatched. The length of /home/work/ssd11/hdfs//datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW getNumBytes() = 174437376 getBytesOnDisk() = 174429752 getVisibleLength()= 174429752 getVolume() = /home/work/ssd11/hdfs/x/datanode/current getBlockFile()= /home/work/ssd11/hdfs/x/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099 bytesAcked=174429752 bytesOnDisk=174429752 {quote} It's caused by the exception while out.write() in receivePacket() of BlockReceiver. receivePacket() first update numbytes in replicaInfo, then write data to disk, and update bytesOnDisk at last, the DiskFull exception makes the length not consistent. > Lease Recovery: File not getting closed on HDFS when block write operation > fails > > > Key: HDFS-10927 > URL: https://issues.apache.org/jira/browse/HDFS-10927 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs >Affects Versions: 2.7.1 >Reporter: Nitin Goswami > > HDFS was unable to close a file when block write operation failed because of > too high disk usage. > Scenario: > HBase was writing WAL logs on HDFS and the disk usage was too high at that > time. While writing these WAL logs, one of the blocks writes operation failed > with the following exception: > 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Exception for > BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899 > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/192.168.194.144:50010 remote=/192.168.192.162:43105] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.BufferedInputStream.fill(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.DataInputStream.read(Unknown Source) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251) > at java.lang.Thread.run(Unknown Source) > After this exception, HBase tried to close/rollover the WAL file but that > call also failed and WAL file couldn't be closed. After this HBase closed the > region server > After some time, Lease Recovery got triggered for this file and following > exceptions starts occurring: > 2016-09-13 11:51:11,743 WARN > org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to > obtain replica info for block > (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from > datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null]) > java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < > getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW > getNumBytes() = 45524696 > getBytesOnDisk() = 45483527 > getVisibleLength()= 45511557 > getVolume() = /opt/reflex/data/yarn/datanode/current > getBlockFile()= >
[jira] [Comment Edited] (HDFS-10927) Lease Recovery: File not getting closed on HDFS when block write operation fails
[ https://issues.apache.org/jira/browse/HDFS-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871310#comment-15871310 ] Chen Zhang edited comment on HDFS-10927 at 2/17/17 7:25 AM: Hey guys, I just found an issue like this. HBase RegionServer got an DiskFull exception while writing WAL files, and client failed after several times retry. When Master trying to use recoverLease to recover these file, we got almost same logs as [~ngoswami] attached {quote} java.io.IOException: File length mismatched. The length of /home/work/ssd11/hdfs//datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW getNumBytes() = 174437376 getBytesOnDisk() = 174429752 getVisibleLength()= 174429752 getVolume() = /home/work/ssd11/hdfs/x/datanode/current getBlockFile()= /home/work/ssd11/hdfs/x/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099 bytesAcked=174429752 bytesOnDisk=174429752 {quote} In my case, it's caused by the exception while out.write() in receivePacket() of BlockReceiver. receivePacket() first update numbytes in replicaInfo, then write data to disk, and update bytesOnDisk at last, the DiskFull exception makes the length not consistent. was (Author: zhangchen): Hey guys, I just found an issue like this. HBase RegionServer got an DiskFull exception while writing WAL files, and client failed after several times retry. When Master trying to use recoverLease to recover these file, we got almost same logs as [~ngoswami] attached {quote} java.io.IOException: File length mismatched. The length of /home/work/ssd11/hdfs//datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099 is 174432256 but r=ReplicaBeingWritten, blk_1073970099_229357, RBW getNumBytes() = 174437376 getBytesOnDisk() = 174429752 getVisibleLength()= 174429752 getVolume() = /home/work/ssd11/hdfs/x/datanode/current getBlockFile()= /home/work/ssd11/hdfs/x/datanode/current/BP-228094273-10.136.5.10-1486630815208/current/rbw/blk_1073970099 bytesAcked=174429752 bytesOnDisk=174429752 {quote} It's caused by the exception while out.write() in receivePacket() of BlockReceiver. receivePacket() first update numbytes in replicaInfo, then write data to disk, and update bytesOnDisk at last, the DiskFull exception makes the length not consistent. > Lease Recovery: File not getting closed on HDFS when block write operation > fails > > > Key: HDFS-10927 > URL: https://issues.apache.org/jira/browse/HDFS-10927 > Project: Hadoop HDFS > Issue Type: Bug > Components: fs >Affects Versions: 2.7.1 >Reporter: Nitin Goswami > > HDFS was unable to close a file when block write operation failed because of > too high disk usage. > Scenario: > HBase was writing WAL logs on HDFS and the disk usage was too high at that > time. While writing these WAL logs, one of the blocks writes operation failed > with the following exception: > 2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Exception for > BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899 > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/192.168.194.144:50010 remote=/192.168.192.162:43105] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.BufferedInputStream.fill(Unknown Source) > at java.io.BufferedInputStream.read1(Unknown Source) > at java.io.BufferedInputStream.read(Unknown Source) > at java.io.DataInputStream.read(Unknown Source) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807) > at >
[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824141#comment-15824141 ] Chen Zhang commented on HDFS-11303: --- Hi Andrew, I’m a newbie to apache community, HDFS-11303 is the first issue I've submitted In last week, I received a mail said you updated this issue, thanks a lot for your attention on this issue! But the issue seems been frozen to me, I can’t do any operation on it now, could you help point out what’s the status of this issue now, and what should I do next? Thanks a lot Best Chen > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813527#comment-15813527 ] Chen Zhang commented on HDFS-11303: --- Stack, thanks for your comments. Yes, the test is just to verify, it hangs without the fix. > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang > Attachments: HDFS-11303-001.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-11303: -- Attachment: HDFS-11303-001.patch > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang > Attachments: HDFS-11303-001.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
Chen Zhang created HDFS-11303: - Summary: Hedged read might hang infinitely if read data from all DN failed Key: HDFS-11303 URL: https://issues.apache.org/jira/browse/HDFS-11303 Project: Hadoop HDFS Issue Type: Bug Components: hdfs-client Affects Versions: 3.0.0-alpha1 Reporter: Chen Zhang Hedged read will read from a DN first, if timeout, then read other DNs simultaneously. If read all DN failed, this bug will cause the future-list not empty(the first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134583#comment-16134583 ] Chen Zhang commented on HDFS-11303: --- Thanks [~jzhuge] a lot for pushing this issue forward. I'm sorry for leaving this issue aside, really busy on the work these days > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, > HDFS-11303-002.patch, HDFS-11303-002.patch, HDFS-11303.003.patch, > HDFS-11303.004.patch, HDFS-11303.005.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046574#comment-16046574 ] Chen Zhang commented on HDFS-11303: --- [~jzhuge] thanks for your help, I've fixed the checkstyle error. The failed unit tests are not related with this patch > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, > HDFS-11303-002.patch, HDFS-11303-002.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-8693) refreshNamenodes does not support adding a new standby to a running DN
[ https://issues.apache.org/jira/browse/HDFS-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062512#comment-16062512 ] Chen Zhang commented on HDFS-8693: -- Hi [~vinayrpet] [~ajithshetty], any progress on this issue? Supporting add a new standby is very useful for large cluster operation. When one of the machine running namenode is down and we have to add another new standby, restarting thousands of datanodes will take very long time. Once the active namenode is crushed during this time, whole cluster will not available. > refreshNamenodes does not support adding a new standby to a running DN > -- > > Key: HDFS-8693 > URL: https://issues.apache.org/jira/browse/HDFS-8693 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, ha >Affects Versions: 2.6.0 >Reporter: Jian Fang >Assignee: Ajith S >Priority: Critical > Attachments: HDFS-8693.02.patch, HDFS-8693.1.patch > > > I tried to run the following command on a Hadoop 2.6.0 cluster with HA > support > $ hdfs dfsadmin -refreshNamenodes datanode-host:port > to refresh name nodes on data nodes after I replaced one name node with a new > one so that I don't need to restart the data nodes. However, I got the > following error: > refreshNamenodes: HA does not currently support adding a new standby to a > running DN. Please do a rolling restart of DNs to reconfigure the list of NNs. > I checked the 2.6.0 code and the error was thrown by the following code > snippet, which led me to this JIRA. > void refreshNNList(ArrayList addrs) throws IOException { > Set oldAddrs = Sets.newHashSet(); > for (BPServiceActor actor : bpServices) > { oldAddrs.add(actor.getNNSocketAddress()); } > Set newAddrs = Sets.newHashSet(addrs); > if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty()) > { // Keep things simple for now -- we can implement this at a later date. > throw new IOException( "HA does not currently support adding a new standby to > a running DN. " + "Please do a rolling restart of DNs to reconfigure the list > of NNs."); } > } > Looks like this the refreshNameNodes command is an uncompleted feature. > Unfortunately, the new name node on a replacement is critical for auto > provisioning a hadoop cluster with HDFS HA support. Without this support, the > HA feature could not really be used. I also observed that the new standby > name node on the replacement instance could stuck in safe mode because no > data nodes check in with it. Even with a rolling restart, it may take quite > some time to restart all data nodes if we have a big cluster, for example, > with 4000 data nodes, let alone restarting DN is way too intrusive and it is > not a preferable operation in production. It also increases the chance for a > double failure because the standby name node is not really ready for a > failover in the case that the current active name node fails. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-11303: -- Attachment: HDFS-11303-002.patch fix checkstyle error > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, > HDFS-11303-002.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-11303: -- Status: In Progress (was: Patch Available) > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, > HDFS-11303-002.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-11303: -- Status: Open (was: Patch Available) > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, > HDFS-11303-002.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-11303: -- Status: Patch Available (was: Open) > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, > HDFS-11303-002.patch, HDFS-11303-002.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-11303: -- Attachment: HDFS-11303-002.patch fix checkstyle issues > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, > HDFS-11303-002.patch, HDFS-11303-002.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11303) Hedged read might hang infinitely if read data from all DN failed
[ https://issues.apache.org/jira/browse/HDFS-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-11303: -- Status: Patch Available (was: In Progress) Re-submit the patch to kick off test-patch again. > Hedged read might hang infinitely if read data from all DN failed > -- > > Key: HDFS-11303 > URL: https://issues.apache.org/jira/browse/HDFS-11303 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 3.0.0-alpha1 >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-11303-001.patch, HDFS-11303-001.patch, > HDFS-11303-002.patch > > > Hedged read will read from a DN first, if timeout, then read other DNs > simultaneously. > If read all DN failed, this bug will cause the future-list not empty(the > first timeout request left in list), and hang in the loop infinitely -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12936) java.lang.OutOfMemoryError: unable to create new native thread
[ https://issues.apache.org/jira/browse/HDFS-12936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294704#comment-16294704 ] Chen Zhang commented on HDFS-12936: --- Hey [~1028344...@qq.com],this error usually means you didn't set your system limits appropriately. There's lots of system limits may cause this issue, such max-threads-per-process, max-open-files-per-process, etc. This [answer on stackoverflow | https://stackoverflow.com/questions/34452302/how-to-increase-maximum-number-of-jvm-threads-linux-64bit] is a great guideline for you to find out which limits is not set appropriately, hope it can help > java.lang.OutOfMemoryError: unable to create new native thread > -- > > Key: HDFS-12936 > URL: https://issues.apache.org/jira/browse/HDFS-12936 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 2.6.0 > Environment: CDH5.12 > hadoop2.6 >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > I configure the max user processes 65535 with any user ,and the datanode > memory is 8G. > When a log of data was been writeen,the datanode was been shutdown. > But I can see the memory use only < 1000M. > Please to see https://pan.baidu.com/s/1o7BE0cy > *DataNode shutdown error log:* > {code:java} > 2017-12-17 23:58:14,422 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > PacketResponder: > BP-1437036909-192.168.17.36-1509097205664:blk_1074725940_987917, > type=HAS_DOWNSTREAM_IN_PIPELINE terminating > 2017-12-17 23:58:31,425 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. > Will retry in 30 seconds. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:714) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154) > at java.lang.Thread.run(Thread.java:745) > 2017-12-17 23:59:01,426 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. > Will retry in 30 seconds. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:714) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154) > at java.lang.Thread.run(Thread.java:745) > 2017-12-17 23:59:05,520 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. > Will retry in 30 seconds. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:714) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:154) > at java.lang.Thread.run(Thread.java:745) > 2017-12-17 23:59:31,429 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Receiving BP-1437036909-192.168.17.36-1509097205664:blk_1074725951_987928 > src: /192.168.17.54:40478 dest: /192.168.17.48:50010 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-13709: -- Status: Patch Available (was: Open) > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which have data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Description: Sometimes we need to run NNBench for some scaling tests after made some improvements on NameNode, so we have to deploy a new HDFS cluster and a new Yarn cluster. If NNBench support multi-cluster access, we only need to deploy a new HDFS test cluster and add it to existing YARN cluster, it'll make the scaling test easier. Even more, if we want to do some A-B test, we have to run NNBench on different HDFS clusters, this patch will be helpful. was: Sometimes we need to run NNBench for some scaling tests after made some improvements on NameNode, so we have to deploy a new HDFS cluster and a new Yarn cluster. If NNBench support multi-cluster access, we only need to deploy a new HDFS test cluster and add it to existing YARN cluster, it'll make the scaling test easier. Even more, if we want to make some A-B test, we have to run NNBench on different HDFS clusters, this patch will be helpful. > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-12967-001.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305877#comment-16305877 ] Chen Zhang commented on HDFS-12967: --- Thanks for your response and suggestion, [~jojochuang]. bq. what about adding a new command parameter so that this support is visible I think it don't need a new command parameter, just using path with prefix like hdfs://some-cluster/user/foo will work. And it's reasonable to add some explanation in help text. bq. You should also consider adding tests in TestNNBench Thanks for reminding this, I'll add a test soon > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang > Attachments: HDFS-12967-001.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-12967) NNBench should support multi-cluster access
Chen Zhang created HDFS-12967: - Summary: NNBench should support multi-cluster access Key: HDFS-12967 URL: https://issues.apache.org/jira/browse/HDFS-12967 Project: Hadoop HDFS Issue Type: Improvement Components: benchmarks Reporter: Chen Zhang Sometimes we need to run NNBench for some scaling tests after made some improvements on NameNode, so we have to deploy a new HDFS cluster and a new Yarn cluster. If NNBench support multi-cluster access, we only need to deploy a new HDFS test cluster and add it to existing YARN cluster, it'll make the scaling test easier. Even more, if we want to make some A-B test, we have to run NNBench on different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Attachment: HDFS-12967-001.patch > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang > Attachments: HDFS-12967-001.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to make some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
Chen Zhang created HDFS-13709: - Summary: Report bad block to NN when transfer block encounter EIO exception Key: HDFS-13709 URL: https://issues.apache.org/jira/browse/HDFS-13709 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Reporter: Chen Zhang Assignee: Chen Zhang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-13709: -- Attachment: HDFS-13709.patch > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which hav data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-13709: -- Description: In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes disk bad track may cause data loss. For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs on A's replica data, and someday B and C crushed at the same time, NN will try to replicate data from A but failed, this block is corrupt now but no one knows, because NN think there is at least 1 healthy replica and it keep trying to replicate it. When reading a replica which hav data on bad track, OS will return an EIO error, if DN reports the bad block as soon as it got an EIO, we can find this case ASAP and try to avoid data loss > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which hav data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-13709: -- Description: In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes disk bad track may cause data loss. For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs on A's replica data, and someday B and C crushed at the same time, NN will try to replicate data from A but failed, this block is corrupt now but no one knows, because NN think there is at least 1 healthy replica and it keep trying to replicate it. When reading a replica which have data on bad track, OS will return an EIO error, if DN reports the bad block as soon as it got an EIO, we can find this case ASAP and try to avoid data loss was: In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes disk bad track may cause data loss. For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs on A's replica data, and someday B and C crushed at the same time, NN will try to replicate data from A but failed, this block is corrupt now but no one knows, because NN think there is at least 1 healthy replica and it keep trying to replicate it. When reading a replica which hav data on bad track, OS will return an EIO error, if DN reports the bad block as soon as it got an EIO, we can find this case ASAP and try to avoid data loss > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which have data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Status: Open (was: Patch Available) submitted the wrong version for patch-002, which would cause the compilation fail update the patch-003 > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, > HDFS-12967-003.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790168#comment-16790168 ] Chen Zhang commented on HDFS-12967: --- [~jojochuang], would you please help me to review this patch again? I am sorry that it has been delayed for so long. > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Status: Patch Available (was: Open) > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Attachment: (was: HDFS-12967-002.patch) > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Attachment: HDFS-12967-002.patch > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790147#comment-16790147 ] Chen Zhang commented on HDFS-12967: --- {quote}Did you try {{TestDFSIO}} instead. Should be the right tool for scale testing. {quote} Thanks [~shv] for your advice, we used TestDFSIO to measure the cluster throughput and performance, I think it's more suitable for data access performance testing,NNBench, as it's name implies, is special designed for stress test of NameNode operations {quote}Also NNBench has {{-baseDir}} option. {quote} My patch is just some minor changes on this option to make NNBench support multi-cluster > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Attachment: HDFS-12967-003.patch > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, > HDFS-12967-003.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Status: Patch Available (was: Open) > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, > HDFS-12967-003.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Status: Patch Available (was: Open) > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Status: In Progress (was: Patch Available) Add UT and update the help message > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work stopped] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-12967 stopped by Chen Zhang. - > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Attachment: HDFS-12967-002.patch > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-3246) pRead equivalent for direct read path
[ https://issues.apache.org/jira/browse/HDFS-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang reassigned HDFS-3246: Assignee: Chen Zhang (was: Henry Robinson) > pRead equivalent for direct read path > - > > Key: HDFS-3246 > URL: https://issues.apache.org/jira/browse/HDFS-3246 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, performance >Affects Versions: 3.0.0-alpha1 >Reporter: Henry Robinson >Assignee: Chen Zhang >Priority: Major > > There is no pread equivalent in ByteBufferReadable. We should consider adding > one. It would be relatively easy to implement for the distributed case > (certainly compared to HDFS-2834), since DFSInputStream does most of the > heavy lifting. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-3246) pRead equivalent for direct read path
[ https://issues.apache.org/jira/browse/HDFS-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767916#comment-16767916 ] Chen Zhang commented on HDFS-3246: -- Hi [~henryr], do you still work on this issue? This issue is very old and didn't update for a long time, so I assigned to me directly, hope you don't mind Thanks > pRead equivalent for direct read path > - > > Key: HDFS-3246 > URL: https://issues.apache.org/jira/browse/HDFS-3246 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client, performance >Affects Versions: 3.0.0-alpha1 >Reporter: Henry Robinson >Assignee: Chen Zhang >Priority: Major > > There is no pread equivalent in ByteBufferReadable. We should consider adding > one. It would be relatively easy to implement for the distributed case > (certainly compared to HDFS-2834), since DFSInputStream does most of the > heavy lifting. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16873174#comment-16873174 ] Chen Zhang commented on HDFS-12967: --- Hi [~jojochuang], thanks for review. I've fixed these checkstyle errors, could you help to review the patch again? Thanks > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, > HDFS-12967-003.patch, HDFS-12967-004.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14576) Avoid block report retry and slow down namenode startup
[ https://issues.apache.org/jira/browse/HDFS-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886081#comment-16886081 ] Chen Zhang commented on HDFS-14576: --- Hi [~hexiaoqiao], we've meet similar problem on our production environment, thousands of DataNode report at almost the same time usually cause the NameNode full GC, our solution is to throttle the max concurrent FBR(e.g. 10), NameNode will reject extra FBR (by throwing an exception), if a DataNode receive the exception on it's first FBR, it will gracefully wait for a peiord of random time(in a given range) before retry. This solution works very well for us, so I want to contribute the code to community, but when I porting this commit, I found that the latest version already support this, It's implemented by BlockReport Lease, see HDFS-7923 We also tried HDFS-6763 and HDFS-7097 that you mentioned, but I think the block report throttle strategy is much more helpful on NameNode restart {quote}but in later CDH versions several patches have been backported that made the initial block report problem largely disappear. Unfortunately I don't have the list of Jiras and their relative impact {quote} FYI,[~sodonnell] > Avoid block report retry and slow down namenode startup > --- > > Key: HDFS-14576 > URL: https://issues.apache.org/jira/browse/HDFS-14576 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Reporter: He Xiaoqiao >Assignee: He Xiaoqiao >Priority: Major > > During namenode startup, the load will be very high since it has to process > every datanodes blockreport one by one. If there are hundreds datanodes block > reports pending process, the issue will be more serious even > #processFirstBlockReport is processed a lot more efficiently than ordinary > block reports. Then some of datanode will retry blockreport and lengthens > restart times. I think we should filter the block report request (via > datanode blockreport retries) which has be processed and return directly then > shorten down restart time. I want to state this proposal may be obvious only > for large cluster. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Status: Patch Available (was: In Progress) > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, > HDFS-12967-003.patch, HDFS-12967-004.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Status: In Progress (was: Patch Available) > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, > HDFS-12967-003.patch, HDFS-12967-004.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12967) NNBench should support multi-cluster access
[ https://issues.apache.org/jira/browse/HDFS-12967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12967: -- Attachment: HDFS-12967-004.patch > NNBench should support multi-cluster access > --- > > Key: HDFS-12967 > URL: https://issues.apache.org/jira/browse/HDFS-12967 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12967-001.patch, HDFS-12967-002.patch, > HDFS-12967-003.patch, HDFS-12967-004.patch > > > Sometimes we need to run NNBench for some scaling tests after made some > improvements on NameNode, so we have to deploy a new HDFS cluster and a new > Yarn cluster. > If NNBench support multi-cluster access, we only need to deploy a new HDFS > test cluster and add it to existing YARN cluster, it'll make the scaling test > easier. > Even more, if we want to do some A-B test, we have to run NNBench on > different HDFS clusters, this patch will be helpful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky
[ https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905212#comment-16905212 ] Chen Zhang commented on HDFS-14654: --- [~John Smith] sure, if set operation is synchronized, then get operation should also be synchronized. > RBF: TestRouterRpc tests are flaky > -- > > Key: HDFS-14654 > URL: https://issues.apache.org/jira/browse/HDFS-14654 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, error.log > > > They sometimes pass and sometimes fail. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905228#comment-16905228 ] Chen Zhang commented on HDFS-14609: --- Hi [~crh] [~tasanuma], I'm working on this Jira. I found that the branch HDFS-13891 was rebased after HDFS-14074 on trunk, which is committed after HADOOP-16314, so the UT still fail even when we run it on the branch HDFS-13891. I've tried to rebase the HDFS-13891 to some older commit(at 12 Oct 2018), but there is too many conflicts. > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905228#comment-16905228 ] Chen Zhang edited comment on HDFS-14609 at 8/12/19 2:01 PM: Hi [~crh] [~tasanuma], I'm working on this Jira. I found that the branch HDFS-13891 was rebased after HDFS-14074 on trunk, which is committed after HADOOP-16314, so the UT still fail even when we run it on the branch HDFS-13891. I've tried to rebase the HDFS-13891 to some older commit(at 12 Oct 2018), but there are too many conflicts. was (Author: zhangchen): Hi [~crh] [~tasanuma], I'm working on this Jira. I found that the branch HDFS-13891 was rebased after HDFS-14074 on trunk, which is committed after HADOOP-16314, so the UT still fail even when we run it on the branch HDFS-13891. I've tried to rebase the HDFS-13891 to some older commit(at 12 Oct 2018), but there is too many conflicts. > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905349#comment-16905349 ] Chen Zhang commented on HDFS-14609: --- update: I switch to branch HDFS-13891 and reverted HADOOP-16314 and HADOOP-16354, then TestRouterWithSecureStartup can succeed but TestRouterHttpDelegationToken still fail. > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14657) Refine NameSystem lock usage during processing FBR
[ https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897276#comment-16897276 ] Chen Zhang commented on HDFS-14657: --- Thanks [~sodonnell] for your analysis. I'm working on a complete solution on the trunk code this week. {quote}However that is probably solvable, either by making the iterator keyed, and reopening it after acquiring the lock (or if it throws concurrentModificationException) at the correct position, {quote} Yes, this is the same solution I used in the next patch, the new patch is almost complete and I'm working on performance testing now. FBR process is much faster on the trunk, so you are right, I'll update the default value to a larger one, according to the test result. > Refine NameSystem lock usage during processing FBR > -- > > Key: HDFS-14657 > URL: https://issues.apache.org/jira/browse/HDFS-14657 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch > > > The disk with 12TB capacity is very normal today, which means the FBR size is > much larger than before, Namenode holds the NameSystemLock during processing > block report for each storage, which might take quite a long time. > On our production environment, processing large FBR usually cause a longer > RPC queue time, which impacts client latency, so we did some simple work on > refining the lock usage, which improved the p99 latency significantly. > In our solution, BlockManager release the NameSystem write lock and request > it again for every 5000 blocks(by default) during processing FBR, with the > fair lock, all the RPC request can be processed before BlockManager > re-acquire the write lock. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14680) StorageInfoDefragmenter should handle exceptions gently
[ https://issues.apache.org/jira/browse/HDFS-14680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897201#comment-16897201 ] Chen Zhang commented on HDFS-14680: --- {quote}But We have this thing in CDH5 for a long time (at least 3 years) and not seeing this causing NN shutdown. This is actually not true. It would shut down NN upon RuntimeException (unchecked exceptions) only. {quote} Agree. so it looks not worth much attention, so should we just leave the Jira here or resolve it as "not a problem" first? > StorageInfoDefragmenter should handle exceptions gently > --- > > Key: HDFS-14680 > URL: https://issues.apache.org/jira/browse/HDFS-14680 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Priority: Major > > StorageInfoDefragmenter is responsible for FoldedTreeSet compaction, but it > terminates NameNode on any exception, is it too radical? > I mean, even the critical threads like HeartbeatManager don't terminates > NameNode once they encounter exceptions, StorageInfoDefragmenter should not > do that either. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14680) StorageInfoDefragmenter should handle exceptions gently
[ https://issues.apache.org/jira/browse/HDFS-14680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897201#comment-16897201 ] Chen Zhang edited comment on HDFS-14680 at 7/31/19 1:52 PM: {quote}But We have this thing in CDH5 for a long time (at least 3 years) and not seeing this causing NN shutdown. This is actually not true. It would shut down NN upon RuntimeException (unchecked exceptions) only. {quote} Agree. so it looks not worth much attention, should we just leave the Jira here or resolve it as "not a problem" first? was (Author: zhangchen): {quote}But We have this thing in CDH5 for a long time (at least 3 years) and not seeing this causing NN shutdown. This is actually not true. It would shut down NN upon RuntimeException (unchecked exceptions) only. {quote} Agree. so it looks not worth much attention, so should we just leave the Jira here or resolve it as "not a problem" first? > StorageInfoDefragmenter should handle exceptions gently > --- > > Key: HDFS-14680 > URL: https://issues.apache.org/jira/browse/HDFS-14680 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Priority: Major > > StorageInfoDefragmenter is responsible for FoldedTreeSet compaction, but it > terminates NameNode on any exception, is it too radical? > I mean, even the critical threads like HeartbeatManager don't terminates > NameNode once they encounter exceptions, StorageInfoDefragmenter should not > do that either. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14680) StorageInfoDefragmenter should handle exceptions gently
[ https://issues.apache.org/jira/browse/HDFS-14680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896739#comment-16896739 ] Chen Zhang edited comment on HDFS-14680 at 7/31/19 1:46 PM: No I don't encounter this issue. I'm working on HDFS-14657 and it relates with HDFS-9260, when I reading code of HDFS-9260, I found this design is too aggressive, StorageInfoDefragmenter should not shutdown NameNode on any exception, because it's not a critical thread, I'mean it should at least retry some times before shutdown NameNode, or maybe it can choose keep running no matter what exception happens, like HeartBeatManager. We're upgrading our production cluster from 2.6 to 3.1, I don't want this happen to our NameNode, so it's just a proposal for discussion. was (Author: zhangchen): No I don't encounter this issue. I'm working on HDFS-14657 and it relates with HDFS-9620, when I reading code of HDFS-9620, I found this design is too aggressive, StorageInfoDefragmenter should not shutdown NameNode on any exception, because it's not a critical thread, I'mean it should at least retry some times before shutdown NameNode, or maybe it can choose keep running no matter what exception happens, like HeartBeatManager. We're upgrading our production cluster from 2.6 to 3.1, I don't want this happen to our NameNode, so it's just a proposal for discussion. > StorageInfoDefragmenter should handle exceptions gently > --- > > Key: HDFS-14680 > URL: https://issues.apache.org/jira/browse/HDFS-14680 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Priority: Major > > StorageInfoDefragmenter is responsible for FoldedTreeSet compaction, but it > terminates NameNode on any exception, is it too radical? > I mean, even the critical threads like HeartbeatManager don't terminates > NameNode once they encounter exceptions, StorageInfoDefragmenter should not > do that either. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14657) Refine NameSystem lock usage during processing FBR
[ https://issues.apache.org/jira/browse/HDFS-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896917#comment-16896917 ] Chen Zhang edited comment on HDFS-14657 at 7/31/19 3:20 PM: Thanks [~shv], but sorry I can't see any problem of this change on 2.6 version. {quote}I believe when you release the lock while iterating over the storage blocks, the iterator may find itself in an isolated chain of the list after reacquiring the lock {quote} It won't happen, because processReport don't iterate the storage blocks at 2.6, the whole FBR procedure(for each storage) can be simplified like this: | # Insert a delimiter into the head of block list(triplets, it's actually a double linked list, so I'll ref it as the block list for simplification) of this storage. # Start a loop, iterate through block report ## Get a block from the report ## Using the block to get the stored BlockInfo object from BlockMap ## Check the status of the block, and add the block to corresponding set(toAdd, toUc, toInvalidate, toCorrupt) ## Move the block to the head of block list(which makes the block placed before delimiter) # Start a loop to iterate through block list, find the blocks after delimiter, add them to toRemove set.| My proposal in this Jira is to release and re-acquire NN lock between step 2.3 and step 2.4. This solution won't affect the correctness of block report procedure for the following reasons: # At last, all the reported block will be moved before delimiter. # If any other thread get the NN lock before 2.4 add adds some new blocks, they will be added in the head of list. # If any other thread get the NN lock before 2.4 and removes some blocks, it won't affect the loop at 2nd step. (Pls notice that the delimiter can't be remove by other threads) # All the blocks after delimiter should be removed According to the reasons described above, the following problem you mentioned also won't happen: {quote}you may remove replicas that were not supposed to be removed {quote} I agree with you that the things are tricky here, but this change is quite simple and I think we still can make clear the impaction. was (Author: zhangchen): Thanks [~shv], but sorry I can't see any problem of this change on 2.6 version. {quote}I believe when you release the lock while iterating over the storage blocks, the iterator may find itself in an isolated chain of the list after reacquiring the lock {quote} It won't happen, because processReport don't iterate the storage blocks at 2.6, the whole FBR procedure(for each storage) can be simplified like this: | # Insert a delimiter into the head of block list(triplets, it's actually a double linked list, so I'll ref it as the block list for simplification) of this storage. # Start a loop, iterate through block report ## Get a block from the report ## Using the block to get the stored BlockInfo object from BlockMap ## Check the status of the block, and add the block to corresponding set(toAdd, toUc, toInvalidate, toCorrupt) ## Move the block to the head of block list(which makes the block placed before delimiter) # Start a loop to iterate through block list, find the blocks after delimiter, add them to toRemove set.| My proposal in this Jira is to release and re-acquire NN lock between 2.3 and 2.4. This solution won't affect the correctness of block report procedure for the following reasons: # At last, all the reported block will be moved before delimiter. # If any other thread acquire the NN lock before 2.4 add adds some new blocks, they will be added in the head of list. # If any other thread acquire the NN lock before 2.4 and removes some blocks, it won't affect the loop at 2nd step. (Pls notice that the delimiter can't be remove by other threads) # All the blocks after delimiter should be removed According to the reasons described above, the following problem you mentioned also won't happen: {quote}you may remove replicas that were not supposed to be removed {quote} I agree with you that the things are tricky here, but this change is quite simple and I think we still can make clear the impaction. > Refine NameSystem lock usage during processing FBR > -- > > Key: HDFS-14657 > URL: https://issues.apache.org/jira/browse/HDFS-14657 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14657-001.patch, HDFS-14657.002.patch > > > The disk with 12TB capacity is very normal today, which means the FBR size is > much larger than before, Namenode holds the NameSystemLock during processing > block report for each storage, which might take quite a long time. > On our production environment, processing large FBR usually cause a longer
[jira] [Comment Edited] (HDFS-14677) TestDataNodeHotSwapVolumes#testAddVolumesConcurrently fails intermittently in trunk
[ https://issues.apache.org/jira/browse/HDFS-14677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895746#comment-16895746 ] Chen Zhang edited comment on HDFS-14677 at 7/30/19 3:56 AM: Thanks [~ayushtkn] [~elgoiri] for your comments I've run the test repeatedly for 1000 times without the patch, it failed 3 times, and all these 3 failures are caused by NullPointerException. After patch, it didn't fail. was (Author: zhangchen): Thanks [~ayushtkn] [~elgoiri] for your comments I've run the test repeatedly for 1000 times without the patch, it failed 3 times. After patch, it didn't fail. > TestDataNodeHotSwapVolumes#testAddVolumesConcurrently fails intermittently in > trunk > --- > > Key: HDFS-14677 > URL: https://issues.apache.org/jira/browse/HDFS-14677 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14677.001.patch > > > Stacktrace: > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testAddVolumesConcurrently(TestDataNodeHotSwapVolumes.java:615) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {code} > see: > [https://builds.apache.org/job/PreCommit-HDFS-Build/27328/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testAddVolumesConcurrently/] > and > [https://builds.apache.org/job/PreCommit-HDFS-Build/27312/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testAddVolumesConcurrently/] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14677) TestDataNodeHotSwapVolumes#testAddVolumesConcurrently fails intermittently in trunk
[ https://issues.apache.org/jira/browse/HDFS-14677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895746#comment-16895746 ] Chen Zhang commented on HDFS-14677: --- Thanks [~ayushtkn] [~elgoiri] for your comments I've run the test repeatedly for 1000 times without the patch, it failed 3 times. After patch, it didn't fail. > TestDataNodeHotSwapVolumes#testAddVolumesConcurrently fails intermittently in > trunk > --- > > Key: HDFS-14677 > URL: https://issues.apache.org/jira/browse/HDFS-14677 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14677.001.patch > > > Stacktrace: > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes.testAddVolumesConcurrently(TestDataNodeHotSwapVolumes.java:615) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:748) > {code} > see: > [https://builds.apache.org/job/PreCommit-HDFS-Build/27328/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testAddVolumesConcurrently/] > and > [https://builds.apache.org/job/PreCommit-HDFS-Build/27312/testReport/org.apache.hadoop.hdfs.server.datanode/TestDataNodeHotSwapVolumes/testAddVolumesConcurrently/] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904918#comment-16904918 ] Chen Zhang commented on HDFS-13709: --- Thanks [~jojochuang] for mentioning me at HDFS-14706, This Jira and HDFS-14706 both introduce the reportBadBlock in different places, I agree with you that we need to reuse the logic of handle bad blocks. I've added a method \{{handleBadBlock}} in DataNode to handle bad-blocks, using the following logic: # If it's called by scanner, then reportBadBlock to NN at any time # If it's the exception from other way(e.g. BlockSender), will first identify whether it's a bad block according to the type of exception. If it's a bad block, then try to markSuspectBlock if blockScanner is enabled, or report to NN if scanner disabled # I leave some specific logic in the \{{VolumeScanner#ScanResultHandler.handle()}} method, I think they are only related with scanner, not all situation > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-13709.002.patch, HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which have data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-13709: -- Attachment: HDFS-13709.002.patch > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-13709.002.patch, HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which have data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14735) File could only be replicated to 0 nodes instead of minReplication (=1)
[ https://issues.apache.org/jira/browse/HDFS-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907776#comment-16907776 ] Chen Zhang commented on HDFS-14735: --- How much spaces left in your cluster? You can enable the debug log to check why the allocation failed > File could only be replicated to 0 nodes instead of minReplication (=1) > --- > > Key: HDFS-14735 > URL: https://issues.apache.org/jira/browse/HDFS-14735 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Tatyana Alexeyev >Priority: Major > > Hello I have intermitent error when running my EMR Hadoop Cluster: > "Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /user/sphdadm/_sqoop/00501bd7b05e4182b5006b9d51 > bafb7f_f405b2f3/_temporary/1/_temporary/attempt_1565136887564_20057_m_00_0/part-m-0.snappy > could only be replicated to 0 nodes instead of minReplication (=1). There > are 5 datanode(s) running and no node(s) are excluded in this operation." > I am running Hadoop version > sphdadm@ip-10-6-15-108 hadoop]$ hadoop version > Hadoop 2.8.5-amzn-4 > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14609: -- Attachment: HDFS-14609.002.patch > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907754#comment-16907754 ] Chen Zhang commented on HDFS-14609: --- Uploaded patch v2, fix checkstyle and whitespace errors > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907779#comment-16907779 ] Chen Zhang commented on HDFS-13709: --- uploaded patch v4 to fix checkstyle and asflicense error, also fixed a failed ut > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, > HDFS-13709.004.patch, HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which have data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-13709: -- Attachment: HDFS-13709.004.patch > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, > HDFS-13709.004.patch, HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which have data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14654) RBF: TestRouterRpc tests are flaky
[ https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14654: -- Attachment: HDFS-14654.004.patch > RBF: TestRouterRpc tests are flaky > -- > > Key: HDFS-14654 > URL: https://issues.apache.org/jira/browse/HDFS-14654 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, > HDFS-14654.003.patch, HDFS-14654.004.patch, error.log > > > They sometimes pass and sometimes fail. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky
[ https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907796#comment-16907796 ] Chen Zhang commented on HDFS-14654: --- Uploaded patch v4 to fix checkstyle error. > RBF: TestRouterRpc tests are flaky > -- > > Key: HDFS-14654 > URL: https://issues.apache.org/jira/browse/HDFS-14654 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, > HDFS-14654.003.patch, HDFS-14654.004.patch, error.log > > > They sometimes pass and sometimes fail. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909230#comment-16909230 ] Chen Zhang commented on HDFS-14609: --- Hi [~crh], do you have time to help review the patch? Also cc [~aagrawal] and [~elgoiri] > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-14742) RBF:TestRouterFaultTolerant tests are flaky
[ https://issues.apache.org/jira/browse/HDFS-14742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang reassigned HDFS-14742: - Assignee: Chen Zhang > RBF:TestRouterFaultTolerant tests are flaky > --- > > Key: HDFS-14742 > URL: https://issues.apache.org/jira/browse/HDFS-14742 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > > [https://builds.apache.org/job/PreCommit-HDFS-Build/27516/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt] > {code:java} > [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.665 > s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.federation.router.TestRouterFaultTolerant > [ERROR] > testWriteWithFailedSubcluster(org.apache.hadoop.hdfs.server.federation.router.TestRouterFaultTolerant) > Time elapsed: 3.516 s <<< FAILURE! > java.lang.AssertionError: > Failed to run "Full tests": > org.apache.hadoop.ipc.RemoteException(java.io.IOException): Cannot find > locations for /HASH_ALL-failsubcluster, because the default nameservice is > disabled to read or write > at > org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.lookupLocation(MountTableResolver.java:425) > at > org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver$1.call(MountTableResolver.java:391) > at > org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver$1.call(MountTableResolver.java:388) > at > com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4876) > at > com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3528) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2277) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2044) > at com.google.common.cache.LocalCache.get(LocalCache.java:3952) > at > com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871) > at > org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.getDestinationForPath(MountTableResolver.java:394) > at > org.apache.hadoop.hdfs.server.federation.resolver.MultipleDestinationMountTableResolver.getDestinationForPath(MultipleDestinationMountTableResolver.java:87) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getLocationsForPath(RouterRpcServer.java:1498) > at > org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getListing(RouterClientProtocol.java:734) > at > org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getListing(RouterRpcServer.java:827) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:732) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553) > at org.apache.hadoop.ipc.Client.call(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1396) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy35.getListing(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:678) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at >
[jira] [Commented] (HDFS-14706) Checksums are not checked if block meta file is less than 7 bytes
[ https://issues.apache.org/jira/browse/HDFS-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909207#comment-16909207 ] Chen Zhang commented on HDFS-14706: --- Thanks [~sodonnell], what's your opinion? [~jojochuang] > Checksums are not checked if block meta file is less than 7 bytes > - > > Key: HDFS-14706 > URL: https://issues.apache.org/jira/browse/HDFS-14706 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Attachments: HDFS-14706.001.patch, HDFS-14706.002.patch > > > If a block and its meta file are corrupted in a certain way, the corruption > can go unnoticed by a client, causing it to return invalid data. > The meta file is expected to always have a header of 7 bytes and then a > series of checksums depending on the length of the block. > If the metafile gets corrupted in such a way, that it is between zero and > less than 7 bytes in length, then the header is incomplete. In > BlockSender.java the logic checks if the metafile length is at least the size > of the header and if it is not, it does not error, but instead returns a NULL > checksum type to the client. > https://github.com/apache/hadoop/blob/b77761b0e37703beb2c033029e4c0d5ad1dce794/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java#L327-L357 > If the client receives a NULL checksum client, it will not validate checksums > at all, and even corrupted data will be returned to the reader. This means > this corrupt will go unnoticed and HDFS will never repair it. Even the Volume > Scanner will not notice the corruption as the checksums are silently ignored. > Additionally, if the meta file does have enough bytes so it attempts to load > the header, and the header is corrupted such that it is not valid, it can > cause the datanode Volume Scanner to exit, which an exception like the > following: > {code} > 2019-08-06 18:16:39,151 ERROR datanode.VolumeScanner: > VolumeScanner(/tmp/hadoop-sodonnell/dfs/data, > DS-7f103313-61ba-4d37-b63d-e8cf7d2ed5f7) exiting because of exception > java.lang.IllegalArgumentException: id=51 out of range [0, 5) > at > org.apache.hadoop.util.DataChecksum$Type.valueOf(DataChecksum.java:76) > at > org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:167) > at > org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:173) > at > org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:139) > at > org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:153) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.loadLastPartialChunkChecksum(FsVolumeImpl.java:1140) > at > org.apache.hadoop.hdfs.server.datanode.FinalizedReplica.loadLastPartialChunkChecksum(FinalizedReplica.java:157) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.getPartialChunkChecksumForFinalized(BlockSender.java:451) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:266) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:446) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633) > 2019-08-06 18:16:39,152 INFO datanode.VolumeScanner: > VolumeScanner(/tmp/hadoop-sodonnell/dfs/data, > DS-7f103313-61ba-4d37-b63d-e8cf7d2ed5f7) exiting. > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky
[ https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909222#comment-16909222 ] Chen Zhang commented on HDFS-14654: --- [~elgoiri] this failure seems unrelated with this patch. We can find it fails at other Jira : https://issues.apache.org/jira/browse/HDFS-14728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=16909046#comment-16909046 The test can pass at local, we may find another flaky test, I'll try to track it in other Jira. Thanks. > RBF: TestRouterRpc tests are flaky > -- > > Key: HDFS-14654 > URL: https://issues.apache.org/jira/browse/HDFS-14654 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, > HDFS-14654.003.patch, HDFS-14654.004.patch, error.log > > > They sometimes pass and sometimes fail. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14742) RBF:TestRouterFaultTolerant tests are flaky
Chen Zhang created HDFS-14742: - Summary: RBF:TestRouterFaultTolerant tests are flaky Key: HDFS-14742 URL: https://issues.apache.org/jira/browse/HDFS-14742 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Chen Zhang [https://builds.apache.org/job/PreCommit-HDFS-Build/27516/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt] {code:java} [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.665 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.federation.router.TestRouterFaultTolerant [ERROR] testWriteWithFailedSubcluster(org.apache.hadoop.hdfs.server.federation.router.TestRouterFaultTolerant) Time elapsed: 3.516 s <<< FAILURE! java.lang.AssertionError: Failed to run "Full tests": org.apache.hadoop.ipc.RemoteException(java.io.IOException): Cannot find locations for /HASH_ALL-failsubcluster, because the default nameservice is disabled to read or write at org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.lookupLocation(MountTableResolver.java:425) at org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver$1.call(MountTableResolver.java:391) at org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver$1.call(MountTableResolver.java:388) at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4876) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3528) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2277) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2044) at com.google.common.cache.LocalCache.get(LocalCache.java:3952) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871) at org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver.getDestinationForPath(MountTableResolver.java:394) at org.apache.hadoop.hdfs.server.federation.resolver.MultipleDestinationMountTableResolver.getDestinationForPath(MultipleDestinationMountTableResolver.java:87) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getLocationsForPath(RouterRpcServer.java:1498) at org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getListing(RouterClientProtocol.java:734) at org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getListing(RouterRpcServer.java:827) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:732) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1001) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:929) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1891) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2921) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553) at org.apache.hadoop.ipc.Client.call(Client.java:1499) at org.apache.hadoop.ipc.Client.call(Client.java:1396) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy35.getListing(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:678) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905724#comment-16905724 ] Chen Zhang commented on HDFS-14609: --- [~crh] Sure I'll work on trunk to fix these tests. Just wondering why Eric and Takanobu both run these tests failed after revert or switch to branch HDFS-13891, so I did some digging works > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14728) RBF:GetDatanodeReport causes a large GC pressure on the NameNodes
[ https://issues.apache.org/jira/browse/HDFS-14728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906049#comment-16906049 ] Chen Zhang commented on HDFS-14728: --- NamenodeBeanMetrics has implemented a dnCache to cache DataNode report from NameNode, I think we can reuse it for GetDatanodeReport rpc > RBF:GetDatanodeReport causes a large GC pressure on the NameNodes > - > > Key: HDFS-14728 > URL: https://issues.apache.org/jira/browse/HDFS-14728 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: xuzq >Priority: Major > > When a cluster contains millions of DNs, *GetDatanodeReport* is pretty > expensive, and it will cause a large GC pressure on NameNode. > When multiple NSs share the millions DNs by federation and the router listens > to the NSs, the problem will be more serious. > All the NSs will be GC at the same time. > RBF should cache the datanode report informations and have an option to > disable the cache. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14654) RBF: TestRouterRpc tests are flaky
[ https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14654: -- Attachment: HDFS-14654.002.patch > RBF: TestRouterRpc tests are flaky > -- > > Key: HDFS-14654 > URL: https://issues.apache.org/jira/browse/HDFS-14654 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, error.log > > > They sometimes pass and sometimes fail. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14711) RBF: RBFMetrics throws NullPointerException if stateStore disabled
[ https://issues.apache.org/jira/browse/HDFS-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904681#comment-16904681 ] Chen Zhang commented on HDFS-14711: --- Hi [~ayushtkn], agree with you, we need to add some NULL check. So this Jira looks duplicate with HDFS-14656. > RBF: RBFMetrics throws NullPointerException if stateStore disabled > -- > > Key: HDFS-14711 > URL: https://issues.apache.org/jira/browse/HDFS-14711 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14711.001.patch > > > In current implementation, if \{{stateStore}} initialize fail, only log an > error message. Actually RBFMetrics can't work normally at this time. > {code:java} > 2019-08-08 22:43:58,024 [qtp812446698-28] ERROR jmx.JMXJsonServlet > (JMXJsonServlet.java:writeAttribute(345)) - getting attribute FilesTotal of > Hadoop:service=NameNode,name=FSNamesystem-2 threw an exception > javax.management.RuntimeMBeanException: java.lang.NullPointerException > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839) > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852) > at > com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651) > at > com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678) > at > org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338) > at org.apache.hadoop.jmx.JMXJsonServlet.listBeans(JMXJsonServlet.java:316) > at org.apache.hadoop.jmx.JMXJsonServlet.doGet(JMXJsonServlet.java:210) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644) > at > org.apache.hadoop.security.authentication.server.ProxyUserAuthenticationFilter.doFilter(ProxyUserAuthenticationFilter.java:104) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592) > at org.apache.hadoop.hdfs.web.AuthFilter.doFilter(AuthFilter.java:51) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:539) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at >
[jira] [Commented] (HDFS-14654) RBF: TestRouterRpc tests are flaky
[ https://issues.apache.org/jira/browse/HDFS-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904680#comment-16904680 ] Chen Zhang commented on HDFS-14654: --- Thanks [~elgoiri] for your review. I've added some javadocs explaining in patch v2, not sure is it enough, please help to review again. > RBF: TestRouterRpc tests are flaky > -- > > Key: HDFS-14654 > URL: https://issues.apache.org/jira/browse/HDFS-14654 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14654.001.patch, HDFS-14654.002.patch, error.log > > > They sometimes pass and sometimes fail. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14714) RBF: implement getReplicatedBlockStats interface
[ https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14714: -- Status: Patch Available (was: Open) > RBF: implement getReplicatedBlockStats interface > > > Key: HDFS-14714 > URL: https://issues.apache.org/jira/browse/HDFS-14714 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14714.001.patch > > > It's not implemented now, we sometime need this interface for cluster monitor > {code:java} > // current implementation > public ReplicatedBlockStats getReplicatedBlockStats() throws IOException { > rpcServer.checkOperation(NameNode.OperationCategory.READ, false); > return null; > } > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14714) RBF: implement getReplicatedBlockStats interface
[ https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910546#comment-16910546 ] Chen Zhang commented on HDFS-14714: --- Thanks [~ayushtkn] for your suggestion, uploaded the patch v1 > RBF: implement getReplicatedBlockStats interface > > > Key: HDFS-14714 > URL: https://issues.apache.org/jira/browse/HDFS-14714 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14714.001.patch > > > It's not implemented now, we sometime need this interface for cluster monitor > {code:java} > // current implementation > public ReplicatedBlockStats getReplicatedBlockStats() throws IOException { > rpcServer.checkOperation(NameNode.OperationCategory.READ, false); > return null; > } > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14714) RBF: implement getReplicatedBlockStats interface
[ https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14714: -- Attachment: HDFS-14714.001.patch > RBF: implement getReplicatedBlockStats interface > > > Key: HDFS-14714 > URL: https://issues.apache.org/jira/browse/HDFS-14714 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14714.001.patch > > > It's not implemented now, we sometime need this interface for cluster monitor > {code:java} > // current implementation > public ReplicatedBlockStats getReplicatedBlockStats() throws IOException { > rpcServer.checkOperation(NameNode.OperationCategory.READ, false); > return null; > } > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14752) backport HDFS-13709 to branch-2
[ https://issues.apache.org/jira/browse/HDFS-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14752: -- Status: Patch Available (was: Open) uploaded the patch v1 > backport HDFS-13709 to branch-2 > --- > > Key: HDFS-14752 > URL: https://issues.apache.org/jira/browse/HDFS-14752 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14752.branch-2.001.patch > > > backport HDFS-13709 (Report bad block to NN when transfer block encounter EIO > exception) to branch-2 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910906#comment-16910906 ] Chen Zhang commented on HDFS-13709: --- Thanks [~jojochuang] for reviewing this patch and merging it. I'll provide a branch-2 patch later, btw, I've a few questions about this: # In which case we need to backport the patch to branch-2? Usually the bugfix and some critical improvements? # Some people open a new Jira to backport to branch-2, some update a new patch in the same Jira, which is better in the practice? > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, > HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which have data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception
[ https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911008#comment-16911008 ] Chen Zhang commented on HDFS-13709: --- Created a new Jira HDFS-14752 to track the branch-2 backport > Report bad block to NN when transfer block encounter EIO exception > -- > > Key: HDFS-13709 > URL: https://issues.apache.org/jira/browse/HDFS-13709 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, > HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch > > > In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes > disk bad track may cause data loss. > For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs > on A's replica data, and someday B and C crushed at the same time, NN will > try to replicate data from A but failed, this block is corrupt now but no one > knows, because NN think there is at least 1 healthy replica and it keep > trying to replicate it. > When reading a replica which have data on bad track, OS will return an EIO > error, if DN reports the bad block as soon as it got an EIO, we can find > this case ASAP and try to avoid data loss -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14752) backport HDFS-13709 to branch-2
[ https://issues.apache.org/jira/browse/HDFS-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14752: -- Attachment: HDFS-14752.branch-2.001.patch > backport HDFS-13709 to branch-2 > --- > > Key: HDFS-14752 > URL: https://issues.apache.org/jira/browse/HDFS-14752 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14752.branch-2.001.patch > > > backport HDFS-13709 (Report bad block to NN when transfer block encounter EIO > exception) to branch-2 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14735) File could only be replicated to 0 nodes instead of minReplication (=1)
[ https://issues.apache.org/jira/browse/HDFS-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910897#comment-16910897 ] Chen Zhang commented on HDFS-14735: --- Sorry for delayed response, [~talexey], you can grep the keyword "is not chosen" in NameNode's log, it will tell you why the nodes can't be allocated. > File could only be replicated to 0 nodes instead of minReplication (=1) > --- > > Key: HDFS-14735 > URL: https://issues.apache.org/jira/browse/HDFS-14735 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Tatyana Alexeyev >Priority: Major > > Hello I have intermitent error when running my EMR Hadoop Cluster: > "Error: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /user/sphdadm/_sqoop/00501bd7b05e4182b5006b9d51 > bafb7f_f405b2f3/_temporary/1/_temporary/attempt_1565136887564_20057_m_00_0/part-m-0.snappy > could only be replicated to 0 nodes instead of minReplication (=1). There > are 5 datanode(s) running and no node(s) are excluded in this operation." > I am running Hadoop version > sphdadm@ip-10-6-15-108 hadoop]$ hadoop version > Hadoop 2.8.5-amzn-4 > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14752) backport HDFS-13709 to branch-2
Chen Zhang created HDFS-14752: - Summary: backport HDFS-13709 to branch-2 Key: HDFS-14752 URL: https://issues.apache.org/jira/browse/HDFS-14752 Project: Hadoop HDFS Issue Type: Bug Reporter: Chen Zhang backport HDFS-13709 (Report bad block to NN when transfer block encounter EIO exception) to branch-2 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-14752) backport HDFS-13709 to branch-2
[ https://issues.apache.org/jira/browse/HDFS-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang reassigned HDFS-14752: - Assignee: Chen Zhang > backport HDFS-13709 to branch-2 > --- > > Key: HDFS-14752 > URL: https://issues.apache.org/jira/browse/HDFS-14752 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > > backport HDFS-13709 (Report bad block to NN when transfer block encounter EIO > exception) to branch-2 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910922#comment-16910922 ] Chen Zhang commented on HDFS-14609: --- Thanks [~crh] for your comments. {quote}It's not clear why hadoop-hdfs changes are needed in this context. {quote} HDFS-14434 ignores user.name query parameter in secure WebHDFS, but the PseudoAuthenticationHandler can still leverage this parameter to pass the kerberos authentication, as you mentioned before: {quote} we may need to modify the test to inject an appropriate no auth filter and bypass auth to maintain the rationale behind the test {quote} If we want to bypass the kerberos authentication, we have to use the user.name parameter, and now the only way we can do this is to send request through URL directly, instead of through \{{WebHdfsFileSystem}}. So we have to do some work to process request and response, I want to reuse the logic in \{{WebHdfsFileSystem}}, but lots of interface can't be accessed out of package, so I have to expose these interfaces through \{{WebHdfsTestUtil}}, that's why we need to modify the hadoop-hdfs project. Do you think it make sense? > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14714) RBF: implement getReplicatedBlockStats interface
[ https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14714: -- Attachment: HDFS-14714.002.patch > RBF: implement getReplicatedBlockStats interface > > > Key: HDFS-14714 > URL: https://issues.apache.org/jira/browse/HDFS-14714 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14714.001.patch, HDFS-14714.002.patch > > > It's not implemented now, we sometime need this interface for cluster monitor > {code:java} > // current implementation > public ReplicatedBlockStats getReplicatedBlockStats() throws IOException { > rpcServer.checkOperation(NameNode.OperationCategory.READ, false); > return null; > } > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14714) RBF: implement getReplicatedBlockStats interface
[ https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911018#comment-16911018 ] Chen Zhang commented on HDFS-14714: --- uploaded patch v2 to fix checkstyle and whitespace error > RBF: implement getReplicatedBlockStats interface > > > Key: HDFS-14714 > URL: https://issues.apache.org/jira/browse/HDFS-14714 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14714.001.patch, HDFS-14714.002.patch > > > It's not implemented now, we sometime need this interface for cluster monitor > {code:java} > // current implementation > public ReplicatedBlockStats getReplicatedBlockStats() throws IOException { > rpcServer.checkOperation(NameNode.OperationCategory.READ, false); > return null; > } > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter
[ https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14730: -- Summary: Remove unused configuration dfs.web.authentication.filter (was: Deprecate configuration dfs.web.authentication.filter ) > Remove unused configuration dfs.web.authentication.filter > -- > > Key: HDFS-14730 > URL: https://issues.apache.org/jira/browse/HDFS-14730 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > > After HADOOP-16314, this configuration is not used any where, so I propose to > deprecate it to avoid misuse. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter
[ https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14730: -- Attachment: HDFS-14730.001.patch > Remove unused configuration dfs.web.authentication.filter > -- > > Key: HDFS-14730 > URL: https://issues.apache.org/jira/browse/HDFS-14730 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14730.001.patch > > > After HADOOP-16314, this configuration is not used any where, so I propose to > deprecate it to avoid misuse. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter
[ https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14730: -- Status: Patch Available (was: Open) uploaded patch v1 > Remove unused configuration dfs.web.authentication.filter > -- > > Key: HDFS-14730 > URL: https://issues.apache.org/jira/browse/HDFS-14730 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14730.001.patch > > > After HADOOP-16314, this configuration is not used any where, so I propose to > deprecate it to avoid misuse. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter
[ https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911027#comment-16911027 ] Chen Zhang commented on HDFS-14730: --- {\{TestRouterHttpDelegationToken}} still using this configuration, but it's actually not work. HDFS-14609 try to fix failed test \{{TestRouterHttpDelegationToken}} and not use this configuration any more, this Jira should commit after HDFS-14609. > Remove unused configuration dfs.web.authentication.filter > -- > > Key: HDFS-14730 > URL: https://issues.apache.org/jira/browse/HDFS-14730 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14730.001.patch > > > After HADOOP-16314, this configuration is not used any where, so I propose to > deprecate it to avoid misuse. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14609: -- Status: Patch Available (was: Open) > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14609: -- Status: Open (was: Patch Available) > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch, HDFS-14609.002.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14730) Remove unused configuration dfs.web.authentication.filter
[ https://issues.apache.org/jira/browse/HDFS-14730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14730: -- Status: Open (was: Patch Available) > Remove unused configuration dfs.web.authentication.filter > -- > > Key: HDFS-14730 > URL: https://issues.apache.org/jira/browse/HDFS-14730 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14730.001.patch > > > After HADOOP-16314, this configuration is not used any where, so I propose to > deprecate it to avoid misuse. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14714) RBF: implement getReplicatedBlockStats interface
[ https://issues.apache.org/jira/browse/HDFS-14714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911108#comment-16911108 ] Chen Zhang commented on HDFS-14714: --- Hi [~ayushtkn], can you help to review the patch? Thanks. > RBF: implement getReplicatedBlockStats interface > > > Key: HDFS-14714 > URL: https://issues.apache.org/jira/browse/HDFS-14714 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Chen Zhang >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14714.001.patch, HDFS-14714.002.patch > > > It's not implemented now, we sometime need this interface for cluster monitor > {code:java} > // current implementation > public ReplicatedBlockStats getReplicatedBlockStats() throws IOException { > rpcServer.checkOperation(NameNode.OperationCategory.READ, false); > return null; > } > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907486#comment-16907486 ] Chen Zhang commented on HDFS-14609: --- Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very helpful. I've fixed these 2 tests, here is some detail; h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal HADOOP-16314 and HADOOP-16354 made some changes which breaks the test: # Added an AuthFilterInitializer, which using {{hadoop.http.authentication.kerberos.*}} instead of {{dfs.web.authentication.kerberos.*}} to initialize kerberos # {{hadoop.http.authentication.kerberos.principal}} has a default value, so even we don't configure this key, the cluster will still start normally h3. TestRouterHttpDelegationToken # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the initial version of this test leveraged this parameter to bypass the kerberos authentication, so after HDFS-14434, it's not work. I added a set of methods to send request by http connection instead of {{WebHdfsFileSystem}} to make it continue working. # HADOOP-16314 changed configuration-key of the authentication filter from {{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}} # For case {{testGetDelegationToken()}}, the server address is set by WebHdfsFileSystem after it get the response, the original address is the address of RouterRpcServer. Since we now send request by http connection directly, it's unnecessary to reset the address, so I removed this assert # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception is also generated by WebHdfsFileSystem and the logic is very complex, I think it's also unnecessary to keep this assert, so I using the 403 detection instead. In the trunk code, the config {{dfs.web.authentication.filter}} is not used anywhere, I propose to deprecate this config, I'll track this in another Jira. > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907486#comment-16907486 ] Chen Zhang edited comment on HDFS-14609 at 8/14/19 5:45 PM: Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very helpful. I've fixed these 2 tests, here is some detail; h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal HADOOP-16314 and HADOOP-16354 made some changes which breaks the test: # Added an AuthFilterInitializer, which using {{hadoop.http.authentication.kerberos.**}} ** instead of {{dfs.web.authentication.kerberos}}*{{*.*}}* to initialize kerberos # {{hadoop.http.authentication.kerberos.principal}} has a default value, so even we don't configure this key, the cluster will still start normally h3. TestRouterHttpDelegationToken # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the initial version of this test leveraged this parameter to bypass the kerberos authentication, so after HDFS-14434, it's not work. I added a set of methods to send request by http connection instead of {{WebHdfsFileSystem}} to make it continue working. # HADOOP-16314 changed configuration-key of the authentication filter from {{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}} # For case {{testGetDelegationToken()}}, the server address is set by WebHdfsFileSystem after it get the response, the original address is the address of RouterRpcServer. Since we now send request by http connection directly, it's unnecessary to reset the address, so I removed this assert # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception is also generated by WebHdfsFileSystem and the logic is very complex, I think it's also unnecessary to keep this assert, so I using the 403 detection instead. In the trunk code, the config {{dfs.web.authentication.filter}} is not used anywhere, I propose to deprecate this config, I'll track this in another Jira. was (Author: zhangchen): Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very helpful. I've fixed these 2 tests, here is some detail; h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal HADOOP-16314 and HADOOP-16354 made some changes which breaks the test: # Added an AuthFilterInitializer, which using {{hadoop.http.authentication.kerberos.*}} instead of {{dfs.web.authentication.kerberos.*}} to initialize kerberos # {{hadoop.http.authentication.kerberos.principal}} has a default value, so even we don't configure this key, the cluster will still start normally h3. TestRouterHttpDelegationToken # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the initial version of this test leveraged this parameter to bypass the kerberos authentication, so after HDFS-14434, it's not work. I added a set of methods to send request by http connection instead of {{WebHdfsFileSystem}} to make it continue working. # HADOOP-16314 changed configuration-key of the authentication filter from {{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}} # For case {{testGetDelegationToken()}}, the server address is set by WebHdfsFileSystem after it get the response, the original address is the address of RouterRpcServer. Since we now send request by http connection directly, it's unnecessary to reset the address, so I removed this assert # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception is also generated by WebHdfsFileSystem and the logic is very complex, I think it's also unnecessary to keep this assert, so I using the 403 detection instead. In the trunk code, the config {{dfs.web.authentication.filter}} is not used anywhere, I propose to deprecate this config, I'll track this in another Jira. > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Comment Edited] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907486#comment-16907486 ] Chen Zhang edited comment on HDFS-14609 at 8/14/19 5:46 PM: Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very helpful. I've fixed these 2 tests, here is some detail; h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal HADOOP-16314 and HADOOP-16354 made some changes which breaks the test: # Added an AuthFilterInitializer, which using {{hadoop.http.authentication.kerberos.\***}} ** ** instead of {{dfs.web.authentication.kerberos}}{{.\}}* to initialize kerberos # {{hadoop.http.authentication.kerberos.principal}} has a default value, so even we don't configure this key, the cluster will still start normally h3. TestRouterHttpDelegationToken # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the initial version of this test leveraged this parameter to bypass the kerberos authentication, so after HDFS-14434, it's not work. I added a set of methods to send request by http connection instead of {{WebHdfsFileSystem}} to make it continue working. # HADOOP-16314 changed configuration-key of the authentication filter from {{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}} # For case {{testGetDelegationToken()}}, the server address is set by WebHdfsFileSystem after it get the response, the original address is the address of RouterRpcServer. Since we now send request by http connection directly, it's unnecessary to reset the address, so I removed this assert # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception is also generated by WebHdfsFileSystem and the logic is very complex, I think it's also unnecessary to keep this assert, so I using the 403 detection instead. In the trunk code, the config {{dfs.web.authentication.filter}} is not used anywhere, I propose to deprecate this config, I'll track this in another Jira. was (Author: zhangchen): Thanks [~tasanuma] for providing the old revision of HDFS-13891, it's very helpful. I've fixed these 2 tests, here is some detail; h3. TestRouterWithSecureStartup#testStartupWithoutSpnegoPrincipal HADOOP-16314 and HADOOP-16354 made some changes which breaks the test: # Added an AuthFilterInitializer, which using {{hadoop.http.authentication.kerberos.**}} ** instead of {{dfs.web.authentication.kerberos}}*{{*.*}}* to initialize kerberos # {{hadoop.http.authentication.kerberos.principal}} has a default value, so even we don't configure this key, the cluster will still start normally h3. TestRouterHttpDelegationToken # HDFS-14434 ignores user.name query parameter in secure WebHDFS, and the initial version of this test leveraged this parameter to bypass the kerberos authentication, so after HDFS-14434, it's not work. I added a set of methods to send request by http connection instead of {{WebHdfsFileSystem}} to make it continue working. # HADOOP-16314 changed configuration-key of the authentication filter from {{dfs.web.authentication.filter}} to {{hadoop.http.filter.initializers}}, so I added an {{NoAuthFilterInitializer}} to initialize {{NoAuthFilter}} # For case {{testGetDelegationToken()}}, the server address is set by WebHdfsFileSystem after it get the response, the original address is the address of RouterRpcServer. Since we now send request by http connection directly, it's unnecessary to reset the address, so I removed this assert # For the case {{testCancelDelegationToken()}}, the {{InvalidToken}} exception is also generated by WebHdfsFileSystem and the logic is very complex, I think it's also unnecessary to keep this assert, so I using the 403 detection instead. In the trunk code, the config {{dfs.web.authentication.filter}} is not used anywhere, I propose to deprecate this config, I'll track this in another Jira. > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Updated] (HDFS-14609) RBF: Security should use common AuthenticationFilter
[ https://issues.apache.org/jira/browse/HDFS-14609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-14609: -- Attachment: HDFS-14609.001.patch > RBF: Security should use common AuthenticationFilter > > > Key: HDFS-14609 > URL: https://issues.apache.org/jira/browse/HDFS-14609 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: CR Hota >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-14609.001.patch > > > We worked on router based federation security as part of HDFS-13532. We kept > it compatible with the way namenode works. However with HADOOP-16314 and > HDFS-16354 in trunk, auth filters seems to have been changed causing tests to > fail. > Changes are needed appropriately in RBF, mainly fixing broken tests. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org