[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously
[ https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003507#comment-17003507 ] Xiaoqiao He commented on HDFS-14997: Thanks [~ayushtkn] for your reminder. I just check it and run {{TestFileChecksum}} at local times but not reproduce. Would like to offer some more information about hprof report since Jenkins log have less information to identify the problem. Please help to figure out link or log if I missing some important info. Thanks again. > BPServiceActor processes commands from NameNode asynchronously > -- > > Key: HDFS-14997 > URL: https://issues.apache.org/jira/browse/HDFS-14997 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, > HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch > > > There are two core functions, report(#sendHeartbeat, #blockReport, > #cacheReport) and #processCommand in #BPServiceActor main process flow. If > processCommand cost long time it will block send report flow. Meanwhile > processCommand could cost long time(over 1000s the worst case I meet) when IO > load of DataNode is very high. Since some IO operations are under > #datasetLock, So it has to wait to acquire #datasetLock long time when > process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat > will not send to NameNode in-time, and trigger other disasters. > I propose to improve #processCommand asynchronously and not block > #BPServiceActor to send heartbeat back to NameNode when meet high IO load. > Notes: > 1. Lifeline could be one effective solution, however some old branches are > not support this feature. > 2. IO operations under #datasetLock is another issue, I think we should solve > it at another JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15003) RBF: Make Router support storage type quota.
[ https://issues.apache.org/jira/browse/HDFS-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003490#comment-17003490 ] Hadoop QA commented on HDFS-15003: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 30s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 22s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 52s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 23s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 13s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 18s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 6s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}100m 35s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 7m 34s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 40s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}176m 38s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestRedudantBlocks | | | hadoop.hdfs.TestFileChecksumCompositeCrc | | | hadoop.hdfs.server.namenode.ha.TestHAAppend | | | hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor | | | hadoop.hdfs.TestDeadNodeDetection | | | hadoop.hdfs.TestDecommissionWithStriped | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.qjournal.client.TestQuorumJournalManager | | | hadoop.hdfs.TestFileChecksum | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15003 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989475/HDFS-15003.007.patch | | Optional Tests | dupname
[jira] [Commented] (HDFS-15080) Fix the issue in reading persistent memory cache with an offset
[ https://issues.apache.org/jira/browse/HDFS-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003487#comment-17003487 ] Hadoop QA commented on HDFS-15080: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 29m 57s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-3.1 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 52s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 15s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 26s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 35s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 30s{color} | {color:green} branch-3.1 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} branch-3.1 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 7s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}100m 42s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}200m 10s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.hdfs.TestLeaseRecovery2 | | | hadoop.hdfs.server.balancer.TestBalancerRPCDelay | | | hadoop.hdfs.server.diskbalancer.TestDiskBalancer | | | hadoop.hdfs.server.balancer.TestBalancer | | | hadoop.hdfs.server.namenode.ha.TestHAAppend | | | hadoop.hdfs.TestRollingUpgrade | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:70a0ef5d4a6 | | JIRA Issue | HDFS-15080 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989472/HDFS-15080-branch-3.1-000.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 1a586a201705 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.1 / 83d676a | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | unit |
[jira] [Comment Edited] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously
[ https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003471#comment-17003471 ] Ayush Saxena edited comment on HDFS-14997 at 12/26/19 5:31 AM: --- Hi [~hexiaoqiao] [~elgoiri] Seems there are couple of tests failing due to heap issue. According to the hprof file, 60 percent of the memory is being occupied by the Datanode Object. Can you give a check, I suspect related. Ref :https://builds.apache.org/job/PreCommit-HDFS-Build/28549/testReport/org.apache.hadoop.hdfs/TestFileChecksum/testStripedFileChecksum3/ There was a similar failure in JENKINS result here too. was (Author: ayushtkn): Hi [~hexiaoqiao] [~elgoiri] Seems there are couple of tests failing due to heap issue. According to the hprof file, 60 percent of the memory is being occupied by the Datanode Object. Can you give a check, I suspect related. > BPServiceActor processes commands from NameNode asynchronously > -- > > Key: HDFS-14997 > URL: https://issues.apache.org/jira/browse/HDFS-14997 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, > HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch > > > There are two core functions, report(#sendHeartbeat, #blockReport, > #cacheReport) and #processCommand in #BPServiceActor main process flow. If > processCommand cost long time it will block send report flow. Meanwhile > processCommand could cost long time(over 1000s the worst case I meet) when IO > load of DataNode is very high. Since some IO operations are under > #datasetLock, So it has to wait to acquire #datasetLock long time when > process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat > will not send to NameNode in-time, and trigger other disasters. > I propose to improve #processCommand asynchronously and not block > #BPServiceActor to send heartbeat back to NameNode when meet high IO load. > Notes: > 1. Lifeline could be one effective solution, however some old branches are > not support this feature. > 2. IO operations under #datasetLock is another issue, I think we should solve > it at another JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously
[ https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003471#comment-17003471 ] Ayush Saxena commented on HDFS-14997: - Hi [~hexiaoqiao] [~elgoiri] Seems there are couple of tests failing due to heap issue. According to the hprof file, 60 percent of the memory is being occupied by the Datanode Object. Can you give a check, I suspect related. > BPServiceActor processes commands from NameNode asynchronously > -- > > Key: HDFS-14997 > URL: https://issues.apache.org/jira/browse/HDFS-14997 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, > HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch > > > There are two core functions, report(#sendHeartbeat, #blockReport, > #cacheReport) and #processCommand in #BPServiceActor main process flow. If > processCommand cost long time it will block send report flow. Meanwhile > processCommand could cost long time(over 1000s the worst case I meet) when IO > load of DataNode is very high. Since some IO operations are under > #datasetLock, So it has to wait to acquire #datasetLock long time when > process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat > will not send to NameNode in-time, and trigger other disasters. > I propose to improve #processCommand asynchronously and not block > #BPServiceActor to send heartbeat back to NameNode when meet high IO load. > Notes: > 1. Lifeline could be one effective solution, however some old branches are > not support this feature. > 2. IO operations under #datasetLock is another issue, I think we should solve > it at another JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15080) Fix the issue in reading persistent memory cache with an offset
[ https://issues.apache.org/jira/browse/HDFS-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feilong He updated HDFS-15080: -- Attachment: HDFS-15080-branch-3.2-000.patch > Fix the issue in reading persistent memory cache with an offset > --- > > Key: HDFS-15080 > URL: https://issues.apache.org/jira/browse/HDFS-15080 > Project: Hadoop HDFS > Issue Type: Bug > Components: caching, datanode >Reporter: Feilong He >Assignee: Feilong He >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-15080-000.patch, HDFS-15080-branch-3.1-000.patch, > HDFS-15080-branch-3.2-000.patch > > > Some applications can read a segment of pmem cache with an offset specified. > The previous implementation for pmem cache read with DirectByteBuffer didn't > cover this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15003) RBF: Make Router support storage type quota.
[ https://issues.apache.org/jira/browse/HDFS-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003455#comment-17003455 ] Jinglun commented on HDFS-15003: Hi [~ayushtkn], thanks your comments ! Add description to HDFSCommands.md, upload v07. > RBF: Make Router support storage type quota. > > > Key: HDFS-15003 > URL: https://issues.apache.org/jira/browse/HDFS-15003 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15003.001.patch, HDFS-15003.002.patch, > HDFS-15003.003.patch, HDFS-15003.004.patch, HDFS-15003.005.patch, > HDFS-15003.006.patch, HDFS-15003.007.patch > > > Make Router support storage type quota. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15003) RBF: Make Router support storage type quota.
[ https://issues.apache.org/jira/browse/HDFS-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15003: --- Attachment: HDFS-15003.007.patch > RBF: Make Router support storage type quota. > > > Key: HDFS-15003 > URL: https://issues.apache.org/jira/browse/HDFS-15003 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15003.001.patch, HDFS-15003.002.patch, > HDFS-15003.003.patch, HDFS-15003.004.patch, HDFS-15003.005.patch, > HDFS-15003.006.patch, HDFS-15003.007.patch > > > Make Router support storage type quota. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor
[ https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003453#comment-17003453 ] Xiaoqiao He commented on HDFS-15075: Jenkins not trigger here, any issues? > Remove process command timing from BPServiceActor > - > > Key: HDFS-15075 > URL: https://issues.apache.org/jira/browse/HDFS-15075 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch > > > HDFS-14997 moved the command processing into async. > Right now, we are checking the time to add to a queue. > We should remove this one and maybe move the timing within the thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15080) Fix the issue in reading persistent memory cache with an offset
[ https://issues.apache.org/jira/browse/HDFS-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feilong He updated HDFS-15080: -- Attachment: HDFS-15080-branch-3.1-000.patch > Fix the issue in reading persistent memory cache with an offset > --- > > Key: HDFS-15080 > URL: https://issues.apache.org/jira/browse/HDFS-15080 > Project: Hadoop HDFS > Issue Type: Bug > Components: caching, datanode >Reporter: Feilong He >Assignee: Feilong He >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-15080-000.patch, HDFS-15080-branch-3.1-000.patch > > > Some applications can read a segment of pmem cache with an offset specified. > The previous implementation for pmem cache read with DirectByteBuffer didn't > cover this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003430#comment-17003430 ] Fei Hui commented on HDFS-15079: [~ayushtkn] {quote} The namenode Logic that you tend to add, that kind of logic is there in Namenode in form of RetryCache, It checks whether the call isn't a repeated one due to failover, if so, it doesn't execute it again rather sends the old response from the cache. {quote} Great. CallId is the client id i tend to add. CallId and clientId maybe can resolve the problem. For NN clientId is from client, but callId is from routerclient. Try to dig in more too. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15074) DataNode.DataTransfer thread should catch all the expception and log it.
[ https://issues.apache.org/jira/browse/HDFS-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003370#comment-17003370 ] Hadoop QA commented on HDFS-15074: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 57s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 39s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 29s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}103m 43s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}166m 2s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestFileChecksum | | | hadoop.hdfs.TestDeadNodeDetection | | | hadoop.hdfs.TestDatanodeRegistration | | | hadoop.hdfs.server.namenode.TestRedudantBlocks | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.TestFileChecksumCompositeCrc | | | hadoop.hdfs.TestRollingUpgrade | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15074 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12989464/HDFS-15074.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 8d59a8e92c65 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 300505c | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | unit |
[jira] [Commented] (HDFS-15074) DataNode.DataTransfer thread should catch all the expception and log it.
[ https://issues.apache.org/jira/browse/HDFS-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003354#comment-17003354 ] hemanthboyina commented on HDFS-15074: -- thanks for the review [~surendrasingh] . updated the patch , please review . > DataNode.DataTransfer thread should catch all the expception and log it. > > > Key: HDFS-15074 > URL: https://issues.apache.org/jira/browse/HDFS-15074 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15074.001.patch, HDFS-15074.002.patch > > > Some time If this thread is throwing exception other than IOException, will > not be able to trash it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15074) DataNode.DataTransfer thread should catch all the expception and log it.
[ https://issues.apache.org/jira/browse/HDFS-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-15074: - Attachment: HDFS-15074.002.patch > DataNode.DataTransfer thread should catch all the expception and log it. > > > Key: HDFS-15074 > URL: https://issues.apache.org/jira/browse/HDFS-15074 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15074.001.patch, HDFS-15074.002.patch > > > Some time If this thread is throwing exception other than IOException, will > not be able to trash it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003325#comment-17003325 ] Xiaoqiao He commented on HDFS-15079: Thanks [~ayushtkn] for your detailed comments. RetryCache may be one solution, however we have to pass clientid and callid to router (it could need to improve protocol, I am not very sure, will check it later) and reset these two parameter then forward this rpc request to namenode. IIRC, we have discussed this solution for long time about data locality but no conclusion up to now. Another side, if we could do that, I am not sure if there are some security issue such as some one using fake clientid + callid and it could pollute next other rpc requests. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
[ https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003321#comment-17003321 ] Xiaoqiao He commented on HDFS-15051: v005 try to fix the above review comments. > RBF: Propose to revoke WRITE MountTableEntry privilege to super user only > - > > Key: HDFS-15051 > URL: https://issues.apache.org/jira/browse/HDFS-15051 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, > HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch > > > The current permission checker of #MountTableStoreImpl is not very restrict. > In some case, any user could add/update/remove MountTableEntry without the > expected permission checking. > The following code segment try to check permission when operate > MountTableEntry, however mountTable object is from Client/RouterAdmin > {{MountTable mountTable = request.getEntry();}}, and user could pass any mode > which could bypass the permission checker. > {code:java} > public void checkPermission(MountTable mountTable, FsAction access) > throws AccessControlException { > if (isSuperUser()) { > return; > } > FsPermission mode = mountTable.getMode(); > if (getUser().equals(mountTable.getOwnerName()) > && mode.getUserAction().implies(access)) { > return; > } > if (isMemberOfGroup(mountTable.getGroupName()) > && mode.getGroupAction().implies(access)) { > return; > } > if (!getUser().equals(mountTable.getOwnerName()) > && !isMemberOfGroup(mountTable.getGroupName()) > && mode.getOtherAction().implies(access)) { > return; > } > throw new AccessControlException( > "Permission denied while accessing mount table " > + mountTable.getSourcePath() > + ": user " + getUser() + " does not have " + access.toString() > + " permissions."); > } > {code} > I just propose revoke WRITE MountTableEntry privilege to super user only. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
[ https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He updated HDFS-15051: --- Attachment: HDFS-15051.005.patch > RBF: Propose to revoke WRITE MountTableEntry privilege to super user only > - > > Key: HDFS-15051 > URL: https://issues.apache.org/jira/browse/HDFS-15051 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, > HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch > > > The current permission checker of #MountTableStoreImpl is not very restrict. > In some case, any user could add/update/remove MountTableEntry without the > expected permission checking. > The following code segment try to check permission when operate > MountTableEntry, however mountTable object is from Client/RouterAdmin > {{MountTable mountTable = request.getEntry();}}, and user could pass any mode > which could bypass the permission checker. > {code:java} > public void checkPermission(MountTable mountTable, FsAction access) > throws AccessControlException { > if (isSuperUser()) { > return; > } > FsPermission mode = mountTable.getMode(); > if (getUser().equals(mountTable.getOwnerName()) > && mode.getUserAction().implies(access)) { > return; > } > if (isMemberOfGroup(mountTable.getGroupName()) > && mode.getGroupAction().implies(access)) { > return; > } > if (!getUser().equals(mountTable.getOwnerName()) > && !isMemberOfGroup(mountTable.getGroupName()) > && mode.getOtherAction().implies(access)) { > return; > } > throw new AccessControlException( > "Permission denied while accessing mount table " > + mountTable.getSourcePath() > + ": user " + getUser() + " does not have " + access.toString() > + " permissions."); > } > {code} > I just propose revoke WRITE MountTableEntry privilege to super user only. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor
[ https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003317#comment-17003317 ] Xiaoqiao He commented on HDFS-15075: Thanks [~elgoiri] for your reviews. v002 try to fix comments above, a. add a config key to instead of constant 2000ms, b. add metrics for some methods who have io operation in #datasetLock of #FsDatasetImpl. c. improve command process monitor metric to mutableRate. Please take another review if have times. Thanks. > Remove process command timing from BPServiceActor > - > > Key: HDFS-15075 > URL: https://issues.apache.org/jira/browse/HDFS-15075 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch > > > HDFS-14997 moved the command processing into async. > Right now, we are checking the time to add to a queue. > We should remove this one and maybe move the timing within the thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15074) DataNode.DataTransfer thread should catch all the expception and log it.
[ https://issues.apache.org/jira/browse/HDFS-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003296#comment-17003296 ] Surendra Singh Lilhore commented on HDFS-15074: --- [~hemanthboyina], minor comment. Just log the exception like this {code:java} LOG.error("Failed to transfer block " + b, t); {code} > DataNode.DataTransfer thread should catch all the expception and log it. > > > Key: HDFS-15074 > URL: https://issues.apache.org/jira/browse/HDFS-15074 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15074.001.patch > > > Some time If this thread is throwing exception other than IOException, will > not be able to trash it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor
[ https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He updated HDFS-15075: --- Attachment: HDFS-15075.002.patch > Remove process command timing from BPServiceActor > - > > Key: HDFS-15075 > URL: https://issues.apache.org/jira/browse/HDFS-15075 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Assignee: Xiaoqiao He >Priority: Major > Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch > > > HDFS-14997 moved the command processing into async. > Right now, we are checking the time to add to a queue. > We should remove this one and maybe move the timing within the thread. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003288#comment-17003288 ] Ayush Saxena commented on HDFS-15079: - Thanx [~ferhui] for the UT. The namenode Logic that you tend to add, that kind of logic is there in Namenode in form of RetryCache, It checks whether the call isn't a repeated one due to failover, if so, it doesn't execute it again rather sends the old response from the cache. You can check {{ClientProtocol.java}} there is an annotation above Create method, you can read its description. That RetryCache logic doesn't seems to be working here, I guess since for it the clientId and callId should be same, but here, since the call is going from different router, not from the same client. So the NN doesn't consider it as a repeated call. Will try to dig in more, but on a quick look, I feel this is the problem. I guess we have a JIRA too as Pass Client CallerContext to NN. May be this issue is also because of that. Not Sure. [~elgoiri] [~hexiaoqiao] Can you also give a check once? > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003264#comment-17003264 ] Fei Hui commented on HDFS-15078: [~elgoiri] {quote} Can we try to do it as an exception handling instead of proactively checking? {quote} Sorry, didn't catch it. Before checking it, everything looks fine. Could you please give some ideas? [~ayushtkn] {quote} Router is supposed to just receive the call, and if it has received a valid call, it should in any case send to namenode. {quote} If connection between router and client is closed, result could not send to client. So maybe sending or not to namennode both are reasonable... Because the call failed for client. > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003233#comment-17003233 ] Fei Hui edited comment on HDFS-15079 at 12/25/19 11:40 AM: --- [~ayushtkn][~elgoiri]Upload an overwrite UT, similar to HDFS-15078 was (Author: ferhui): [~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078 > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003233#comment-17003233 ] Fei Hui commented on HDFS-15079: [~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078 > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fei Hui updated HDFS-15079: --- Attachment: UnexpectedOverWriteUT.patch > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > Attachments: UnexpectedOverWriteUT.patch > > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14993) checkDiskError doesn't work during datanode startup
[ https://issues.apache.org/jira/browse/HDFS-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003180#comment-17003180 ] Hadoop QA commented on HDFS-14993: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 42s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 31s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 28s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}110m 23s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 44s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}172m 40s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestFileChecksum | | | hadoop.hdfs.server.namenode.TestRedudantBlocks | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.qjournal.server.TestJournalNodeSync | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-14993 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12987813/HDFS-14993.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 1a94f599ce14 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / df622cf | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28568/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28568/testReport/ | | Max. process+thread count | 3527 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | |
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003174#comment-17003174 ] Íñigo Goiri commented on HDFS-15078: Can we try to do it as an exception handling instead of proactively checking? > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003153#comment-17003153 ] Fei Hui edited comment on HDFS-15079 at 12/25/19 9:09 AM: -- [~elgoiri] HDFS-15078 has a test case, it's one case for this. [~hexiaoqiao] Client gets Exception, but the exception is not that router throws. client logs as follow {quote} java.io.EOFException: End of File Exception between local host is: "xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:1507) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy19.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy20.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135) at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228) at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008) {quote} I think maybe consistency is not guaranteed if do not resolve it on nn side. was (Author: ferhui): [~elgoiri] HDFS-15078 has a test case, it's one case for this. [~hexiaoqiao] Client gets Exception, but the exception is not that router throws. client logs as follow {quote} java.io.EOFException: End of File Exception between local host is: "xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; For more details see:
[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode
[ https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003154#comment-17003154 ] Ayush Saxena commented on HDFS-15078: - Thanx [~ferhui] for the details. TBH I have hard feeling for this fix, and I don't consider this as an Improvement too. This finds attention as a bug only for the scenario you told, otherwise I don't think there should be anything like router receiving the call and not sending to Namenode. Router is supposed to just receive the call, and if it has received a valid call, it should in any case send to namenode. For a client he is sending the request to the NN itself, Call vanishing in between at Router doesn't make sense to me. I would rather like to fix this problem as whole what HDFS-15079 tends to do, rather than just handling one possibility which can minimize the effect. Moreover, having checks at router for every call is also an added overhead for normal calls. We have lately faced perfomance issues recently too regarding calls taking non-trivial amount of time at the Router itself. Anyway, I am not blocking this in anyway, If others are Ok with this, I pose no objections. :) > RBF: Should check connection channel before sending rpc to namenode > --- > > Key: HDFS-15078 > URL: https://issues.apache.org/jira/browse/HDFS-15078 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch > > > dfsrouter logs show that > {quote} > 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from > 10.83.164.11:56908 Call#2 Retry#0: output error > 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 125 on caught an exception > java.nio.channels.ClosedChannelException > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731) > at org.apache.hadoop.ipc.Server.access$2100(Server.java:134) > at > org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089) > at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161) > at > org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109) > at > org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229) > at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245) > {quote} > Maybe checking connection between client and router is better before > sendingrpc to namenode -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003153#comment-17003153 ] Fei Hui commented on HDFS-15079: [~elgoiri] HDFS-15078 has a test case, it's one case for this. [~hexiaoqiao] Client gets Exception, but the exception is not that router throws. client logs as follow {quote} java.io.EOFException: End of File Exception between local host is: "xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765) at org.apache.hadoop.ipc.Client.call(Client.java:1507) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy19.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy20.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135) at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228) at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008) {quote} I think maybe consistency is not guaranteed if resolve it on nn side. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003148#comment-17003148 ] Ayush Saxena commented on HDFS-15079: - Thanx [~ferhui] for the report, The issue is at Router side, and the proposed solution gets changes both at NN and client side, I don't think the proposed solution would get much of agreement, So, I didn't dig in much about this solution. I need to check how actually this is happening, in general case, if it was a namenode-client scenario, I guess the client tries 10 times({{ipc.client.connect.max.retries}}) and then failover to the next. May be this isn't happening? I think I am missing some context. It would be good you can put up a UT to repro this. May be then it could be easy for us to find a solution. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003139#comment-17003139 ] Xiaoqiao He commented on HDFS-15079: Move this JIRA to subtask of HDFS-14603. Thanks [~ferhui] for involves. {quote}Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and failovers to r1. {quote} I am interested in which and where exception throws? a. namenode(throw exception) -> router -> client? b. router(throw exception) -> client? In my experience we try to scheduler client request to the original router if it meet some retry exceptions unless this router set to offline. it may avoid this case. To be honest, this is not very graceful solution. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly
[ https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003138#comment-17003138 ] Íñigo Goiri commented on HDFS-15079: The solution might be a little involved in the NN side. Can we start with a test showing the issue? I'd like to go through the debugger. > RBF: Client maybe get an unexpected result with network anomaly > > > Key: HDFS-15079 > URL: https://issues.apache.org/jira/browse/HDFS-15079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Affects Versions: 3.3.0 >Reporter: Fei Hui >Priority: Critical > > I find there is a critical problem on RBF, HDFS-15078 can resolve it on some > Scenarios, but i have no idea about the overall resolution. > The problem is that > Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and > failovers to r1 > r0 has been send create rpc to namenode(1st create) > Client create a HDFS file via r1(2nd create) > Client writes the HDFS file and close it finally(3rd close) > Maybe namenode receiving the rpc in order as follow > 2nd create > 3rd close > 1st create > And overwrite is true by default, this would make the file had been written > an empty file. This is an critical problem > We had encountered this problem. There are many hive and spark jobs running > on our cluster, sometimes it occurs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org