[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003507#comment-17003507
 ] 

Xiaoqiao He commented on HDFS-14997:


Thanks [~ayushtkn] for your reminder. I just check it and run 
{{TestFileChecksum}} at local times but not reproduce. Would like to offer some 
more information about hprof report since Jenkins log have less information to 
identify the problem. Please help to figure out link or log if I missing some 
important info. Thanks again.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15003) RBF: Make Router support storage type quota.

2019-12-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003490#comment-17003490
 ] 

Hadoop QA commented on HDFS-15003:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
30s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
22s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 23s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
13s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 18s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
6s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}100m 35s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  7m 
34s{color} | {color:green} hadoop-hdfs-rbf in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
40s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}176m 38s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.TestRedudantBlocks |
|   | hadoop.hdfs.TestFileChecksumCompositeCrc |
|   | hadoop.hdfs.server.namenode.ha.TestHAAppend |
|   | hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor |
|   | hadoop.hdfs.TestDeadNodeDetection |
|   | hadoop.hdfs.TestDecommissionWithStriped |
|   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
|   | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.qjournal.client.TestQuorumJournalManager |
|   | hadoop.hdfs.TestFileChecksum |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15003 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12989475/HDFS-15003.007.patch |
| Optional Tests |  dupname  

[jira] [Commented] (HDFS-15080) Fix the issue in reading persistent memory cache with an offset

2019-12-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003487#comment-17003487
 ] 

Hadoop QA commented on HDFS-15080:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 29m 
57s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-3.1 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
52s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
15s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
57s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
26s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 35s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
30s{color} | {color:green} branch-3.1 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} branch-3.1 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  7s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}100m 42s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}200m 10s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMXBean |
|   | hadoop.hdfs.TestLeaseRecovery2 |
|   | hadoop.hdfs.server.balancer.TestBalancerRPCDelay |
|   | hadoop.hdfs.server.diskbalancer.TestDiskBalancer |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.server.namenode.ha.TestHAAppend |
|   | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:70a0ef5d4a6 |
| JIRA Issue | HDFS-15080 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12989472/HDFS-15080-branch-3.1-000.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 1a586a201705 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | branch-3.1 / 83d676a |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
| unit | 

[jira] [Comment Edited] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003471#comment-17003471
 ] 

Ayush Saxena edited comment on HDFS-14997 at 12/26/19 5:31 AM:
---

Hi [~hexiaoqiao] [~elgoiri]

Seems there are couple of tests failing due to heap issue. According to the 
hprof file, 60 percent of the memory is being occupied by the Datanode Object. 
Can you give a check, I suspect related.

Ref 
:https://builds.apache.org/job/PreCommit-HDFS-Build/28549/testReport/org.apache.hadoop.hdfs/TestFileChecksum/testStripedFileChecksum3/

There was a similar failure in JENKINS result here too.


was (Author: ayushtkn):
Hi [~hexiaoqiao] [~elgoiri]

Seems there are couple of tests failing due to heap issue. According to the 
hprof file, 60 percent of the memory is being occupied by the Datanode Object. 
Can you give a check, I suspect related.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2019-12-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003471#comment-17003471
 ] 

Ayush Saxena commented on HDFS-14997:
-

Hi [~hexiaoqiao] [~elgoiri]

Seems there are couple of tests failing due to heap issue. According to the 
hprof file, 60 percent of the memory is being occupied by the Datanode Object. 
Can you give a check, I suspect related.

> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997.001.patch, HDFS-14997.002.patch, 
> HDFS-14997.003.patch, HDFS-14997.004.patch, HDFS-14997.005.patch
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15080) Fix the issue in reading persistent memory cache with an offset

2019-12-25 Thread Feilong He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feilong He updated HDFS-15080:
--
Attachment: HDFS-15080-branch-3.2-000.patch

> Fix the issue in reading persistent memory cache with an offset
> ---
>
> Key: HDFS-15080
> URL: https://issues.apache.org/jira/browse/HDFS-15080
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: caching, datanode
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-15080-000.patch, HDFS-15080-branch-3.1-000.patch, 
> HDFS-15080-branch-3.2-000.patch
>
>
> Some applications can read a segment of pmem cache with an offset specified. 
> The previous implementation for pmem cache read with DirectByteBuffer didn't 
> cover this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15003) RBF: Make Router support storage type quota.

2019-12-25 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003455#comment-17003455
 ] 

Jinglun commented on HDFS-15003:


Hi [~ayushtkn], thanks your comments ! Add description to HDFSCommands.md, 
upload v07.

> RBF: Make Router support storage type quota.
> 
>
> Key: HDFS-15003
> URL: https://issues.apache.org/jira/browse/HDFS-15003
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15003.001.patch, HDFS-15003.002.patch, 
> HDFS-15003.003.patch, HDFS-15003.004.patch, HDFS-15003.005.patch, 
> HDFS-15003.006.patch, HDFS-15003.007.patch
>
>
> Make Router support storage type quota.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15003) RBF: Make Router support storage type quota.

2019-12-25 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15003:
---
Attachment: HDFS-15003.007.patch

> RBF: Make Router support storage type quota.
> 
>
> Key: HDFS-15003
> URL: https://issues.apache.org/jira/browse/HDFS-15003
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15003.001.patch, HDFS-15003.002.patch, 
> HDFS-15003.003.patch, HDFS-15003.004.patch, HDFS-15003.005.patch, 
> HDFS-15003.006.patch, HDFS-15003.007.patch
>
>
> Make Router support storage type quota.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2019-12-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003453#comment-17003453
 ] 

Xiaoqiao He commented on HDFS-15075:


Jenkins not trigger here, any issues?

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15080) Fix the issue in reading persistent memory cache with an offset

2019-12-25 Thread Feilong He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feilong He updated HDFS-15080:
--
Attachment: HDFS-15080-branch-3.1-000.patch

> Fix the issue in reading persistent memory cache with an offset
> ---
>
> Key: HDFS-15080
> URL: https://issues.apache.org/jira/browse/HDFS-15080
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: caching, datanode
>Reporter: Feilong He
>Assignee: Feilong He
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-15080-000.patch, HDFS-15080-branch-3.1-000.patch
>
>
> Some applications can read a segment of pmem cache with an offset specified. 
> The previous implementation for pmem cache read with DirectByteBuffer didn't 
> cover this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003430#comment-17003430
 ] 

Fei Hui commented on HDFS-15079:


[~ayushtkn]
{quote}
The namenode Logic that you tend to add, that kind of logic is there in 
Namenode in form of RetryCache, It checks whether the call isn't a repeated one 
due to failover, if so, it doesn't execute it again rather sends the old 
response from the cache. 
{quote}
Great. CallId is the client id i tend to add. CallId and clientId maybe can 
resolve the problem.
For NN clientId is from client, but callId is from routerclient. Try to dig in 
more too.

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15074) DataNode.DataTransfer thread should catch all the expception and log it.

2019-12-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003370#comment-17003370
 ] 

Hadoop QA commented on HDFS-15074:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
57s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 39s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 29s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}103m 43s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}166m  2s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestFileChecksum |
|   | hadoop.hdfs.TestDeadNodeDetection |
|   | hadoop.hdfs.TestDatanodeRegistration |
|   | hadoop.hdfs.server.namenode.TestRedudantBlocks |
|   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
|   | hadoop.hdfs.TestFileChecksumCompositeCrc |
|   | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15074 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12989464/HDFS-15074.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 8d59a8e92c65 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 300505c |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
| unit | 

[jira] [Commented] (HDFS-15074) DataNode.DataTransfer thread should catch all the expception and log it.

2019-12-25 Thread hemanthboyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003354#comment-17003354
 ] 

hemanthboyina commented on HDFS-15074:
--

thanks for the review [~surendrasingh] .  
updated the patch , please review . 

> DataNode.DataTransfer thread should catch all the expception and log it.
> 
>
> Key: HDFS-15074
> URL: https://issues.apache.org/jira/browse/HDFS-15074
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15074.001.patch, HDFS-15074.002.patch
>
>
> Some time If this thread is throwing exception other than IOException, will 
> not be able to trash it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15074) DataNode.DataTransfer thread should catch all the expception and log it.

2019-12-25 Thread hemanthboyina (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemanthboyina updated HDFS-15074:
-
Attachment: HDFS-15074.002.patch

> DataNode.DataTransfer thread should catch all the expception and log it.
> 
>
> Key: HDFS-15074
> URL: https://issues.apache.org/jira/browse/HDFS-15074
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15074.001.patch, HDFS-15074.002.patch
>
>
> Some time If this thread is throwing exception other than IOException, will 
> not be able to trash it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003325#comment-17003325
 ] 

Xiaoqiao He commented on HDFS-15079:


Thanks [~ayushtkn] for your detailed comments. RetryCache may be one solution, 
however we have to pass clientid and callid to router (it could need to improve 
protocol, I am not very sure, will check it later) and reset these two 
parameter then forward this rpc request to namenode. IIRC, we have discussed 
this solution for long time about data locality but no conclusion up to now. 
Another side, if we could do that, I am not sure if there are some security 
issue such as some one using fake clientid + callid and it could pollute next 
other rpc requests. 

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003321#comment-17003321
 ] 

Xiaoqiao He commented on HDFS-15051:


v005 try to fix the above review comments.

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15051) RBF: Propose to revoke WRITE MountTableEntry privilege to super user only

2019-12-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15051:
---
Attachment: HDFS-15051.005.patch

> RBF: Propose to revoke WRITE MountTableEntry privilege to super user only
> -
>
> Key: HDFS-15051
> URL: https://issues.apache.org/jira/browse/HDFS-15051
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15051.001.patch, HDFS-15051.002.patch, 
> HDFS-15051.003.patch, HDFS-15051.004.patch, HDFS-15051.005.patch
>
>
> The current permission checker of #MountTableStoreImpl is not very restrict. 
> In some case, any user could add/update/remove MountTableEntry without the 
> expected permission checking.
> The following code segment try to check permission when operate 
> MountTableEntry, however mountTable object is from Client/RouterAdmin 
> {{MountTable mountTable = request.getEntry();}}, and user could pass any mode 
> which could bypass the permission checker.
> {code:java}
>   public void checkPermission(MountTable mountTable, FsAction access)
>   throws AccessControlException {
> if (isSuperUser()) {
>   return;
> }
> FsPermission mode = mountTable.getMode();
> if (getUser().equals(mountTable.getOwnerName())
> && mode.getUserAction().implies(access)) {
>   return;
> }
> if (isMemberOfGroup(mountTable.getGroupName())
> && mode.getGroupAction().implies(access)) {
>   return;
> }
> if (!getUser().equals(mountTable.getOwnerName())
> && !isMemberOfGroup(mountTable.getGroupName())
> && mode.getOtherAction().implies(access)) {
>   return;
> }
> throw new AccessControlException(
> "Permission denied while accessing mount table "
> + mountTable.getSourcePath()
> + ": user " + getUser() + " does not have " + access.toString()
> + " permissions.");
>   }
> {code}
> I just propose revoke WRITE MountTableEntry privilege to super user only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15075) Remove process command timing from BPServiceActor

2019-12-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003317#comment-17003317
 ] 

Xiaoqiao He commented on HDFS-15075:


Thanks [~elgoiri] for your reviews.
v002 try to fix comments above,
a. add a config key to instead of constant 2000ms,
b. add metrics for some methods who have io operation in #datasetLock of 
#FsDatasetImpl.
c. improve command process monitor metric to mutableRate.
Please take another review if have times. Thanks.

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15074) DataNode.DataTransfer thread should catch all the expception and log it.

2019-12-25 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003296#comment-17003296
 ] 

Surendra Singh Lilhore commented on HDFS-15074:
---

[~hemanthboyina], minor comment.

Just log the exception like this
{code:java}
LOG.error("Failed to transfer block " + b, t); {code}

> DataNode.DataTransfer thread should catch all the expception and log it.
> 
>
> Key: HDFS-15074
> URL: https://issues.apache.org/jira/browse/HDFS-15074
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15074.001.patch
>
>
> Some time If this thread is throwing exception other than IOException, will 
> not be able to trash it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15075) Remove process command timing from BPServiceActor

2019-12-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15075:
---
Attachment: HDFS-15075.002.patch

> Remove process command timing from BPServiceActor
> -
>
> Key: HDFS-15075
> URL: https://issues.apache.org/jira/browse/HDFS-15075
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: Xiaoqiao He
>Priority: Major
> Attachments: HDFS-15075.001.patch, HDFS-15075.002.patch
>
>
> HDFS-14997 moved the command processing into async.
> Right now, we are checking the time to add to a queue.
> We should remove this one and maybe move the timing within the thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003288#comment-17003288
 ] 

Ayush Saxena commented on HDFS-15079:
-

Thanx [~ferhui] for the UT.
The namenode Logic that you tend to add, that kind of logic is there in 
Namenode in form of RetryCache, It checks whether the call isn't a repeated one 
due to failover, if so, it doesn't execute it again rather sends the old 
response from the cache. You can check {{ClientProtocol.java}} there is an 
annotation above Create method, you can read its description.
That RetryCache logic doesn't seems to be working here, I guess since for it 
the clientId and callId should be same, but here, since the call is going from 
different router, not from the same client. So the NN doesn't consider it as a 
repeated call.
Will try to dig in more, but on a quick look, I feel this is the problem. I 
guess we have a JIRA too as Pass Client CallerContext to NN. May be this issue 
is also because of that. Not Sure.
[~elgoiri] [~hexiaoqiao] Can you also give a check once?


> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003264#comment-17003264
 ] 

Fei Hui commented on HDFS-15078:


[~elgoiri]
{quote}
Can we try to do it as an exception handling instead of proactively checking?
{quote}
Sorry, didn't catch it. Before checking it, everything looks fine. Could you 
please give some ideas?

[~ayushtkn]
{quote}
Router is supposed to just receive the call, and if it has received a valid 
call, it should in any case send to namenode. 
{quote}
If connection between router and client is closed, result could not send to 
client. So maybe sending or not to namennode both are reasonable... Because the 
call failed for client.

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003233#comment-17003233
 ] 

Fei Hui edited comment on HDFS-15079 at 12/25/19 11:40 AM:
---

[~ayushtkn][~elgoiri]Upload an overwrite UT, similar to HDFS-15078



was (Author: ferhui):
[~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078


> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003233#comment-17003233
 ] 

Fei Hui commented on HDFS-15079:


[~ayushtkn][~elgoiri]Upload a overwrite UT, similar to HDFS-15078


> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-15079:
---
Attachment: UnexpectedOverWriteUT.patch

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
> Attachments: UnexpectedOverWriteUT.patch
>
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14993) checkDiskError doesn't work during datanode startup

2019-12-25 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003180#comment-17003180
 ] 

Hadoop QA commented on HDFS-14993:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
42s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 31s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 28s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}110m 23s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
44s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}172m 40s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestFileChecksum |
|   | hadoop.hdfs.server.namenode.TestRedudantBlocks |
|   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
|   | hadoop.hdfs.qjournal.server.TestJournalNodeSync |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.5 Server=19.03.5 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-14993 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12987813/HDFS-14993.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 1a94f599ce14 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / df622cf |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28568/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28568/testReport/ |
| Max. process+thread count | 3527 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| 

[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003174#comment-17003174
 ] 

Íñigo Goiri commented on HDFS-15078:


Can we try to do it as an exception handling instead of proactively checking? 

> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003153#comment-17003153
 ] 

Fei Hui edited comment on HDFS-15079 at 12/25/19 9:09 AM:
--

[~elgoiri]
HDFS-15078 has a test case, it's one case for this.

[~hexiaoqiao]
 Client gets Exception, but the exception is not that router throws. client 
logs as follow
{quote}
java.io.EOFException: End of File Exception between local host is: 
"xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; 
For more details see:  http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
at org.apache.hadoop.ipc.Client.call(Client.java:1507)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy19.create(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy20.create(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
at 
org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228)
at 
org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122)
at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008)
{quote}

I think maybe consistency is not guaranteed if do not resolve it on nn side.


was (Author: ferhui):
[~elgoiri]
HDFS-15078 has a test case, it's one case for this.

[~hexiaoqiao]
 Client gets Exception, but the exception is not that router throws. client 
logs as follow
{quote}
java.io.EOFException: End of File Exception between local host is: 
"xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; 
For more details see:  

[jira] [Commented] (HDFS-15078) RBF: Should check connection channel before sending rpc to namenode

2019-12-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003154#comment-17003154
 ] 

Ayush Saxena commented on HDFS-15078:
-

Thanx [~ferhui] for the details. 
TBH I have hard feeling for this fix, and I don't consider this as an 
Improvement too. This finds attention as a bug only for the scenario you told, 
otherwise I don't think there should be anything like router receiving the call 
and not sending to Namenode. Router is supposed to just receive the call, and 
if it has received a valid call, it should in any case send to namenode. For a 
client he is sending the request to the NN itself, Call vanishing in between at 
Router doesn't make sense to me.

I would rather like to fix this problem as whole what HDFS-15079 tends to do, 
rather than just handling one possibility which can minimize the effect. 
Moreover, having checks at router for every call is also an added overhead for 
normal calls. We have lately faced perfomance issues recently too regarding 
calls taking non-trivial amount of time at the Router itself.

Anyway, I am not blocking this in anyway, If others are Ok with this, I pose no 
objections. :)


> RBF: Should check connection channel before sending rpc to namenode
> ---
>
> Key: HDFS-15078
> URL: https://issues.apache.org/jira/browse/HDFS-15078
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-15078.001.patch, HDFS-15078.002.patch
>
>
> dfsrouter logs show that
> {quote}
> 2019-12-20 04:11:26,724 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 6400 on , call org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.83.164.11:56908 Call#2 Retry#0: output error
> 2019-12-20 04:11:26,724 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 125 on  caught an exception
> java.nio.channels.ClosedChannelException
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2731)
> at org.apache.hadoop.ipc.Server.access$2100(Server.java:134)
> at 
> org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1089)
> at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1161)
> at 
> org.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2109)
> at 
> org.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1229)
> at org.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:631)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2245)
> {quote}
> Maybe checking connection between client and router is better before 
> sendingrpc to namenode



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003153#comment-17003153
 ] 

Fei Hui commented on HDFS-15079:


[~elgoiri]
HDFS-15078 has a test case, it's one case for this.

[~hexiaoqiao]
 Client gets Exception, but the exception is not that router throws. client 
logs as follow
{quote}
java.io.EOFException: End of File Exception between local host is: 
"xx.xx.xx.xx"; destination host is: "xx.xx.xx.xx":; : java.io.EOFException; 
For more details see:  http://wiki.apache.org/hadoop/EOFException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
at org.apache.hadoop.ipc.Client.call(Client.java:1507)
at org.apache.hadoop.ipc.Client.call(Client.java:1441)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy19.create(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:303)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:253)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy20.create(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:264)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1727)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1662)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:503)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:499)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:514)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:442)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:979)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:872)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
at 
org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:228)
at 
org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:122)
at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at 
org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1008)
{quote}

I think maybe consistency is not guaranteed if resolve it on nn side.

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
>
>  I find there is a 

[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003148#comment-17003148
 ] 

Ayush Saxena commented on HDFS-15079:
-

Thanx [~ferhui] for the report, The issue is at Router side, and the proposed 
solution gets changes both at NN and client side, I don't think the proposed 
solution would get much of agreement, So, I didn't dig in much about this 
solution.

I need to check how actually this is happening, in general case, if it was a 
namenode-client scenario, I guess the client tries 10 
times({{ipc.client.connect.max.retries}}) and then failover to the next. May be 
this isn't happening? I think I am missing some context. It would be good you 
can put up a UT to repro this. May be then it could be easy for us to find a 
solution.

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003139#comment-17003139
 ] 

Xiaoqiao He commented on HDFS-15079:


Move this JIRA to subtask of HDFS-14603.

Thanks [~ferhui] for involves.
{quote}Client with RBF(r0, r1) create a file HDFS file via r0, it gets 
Exception and failovers to r1.
{quote}
I am interested in which and where exception throws?
a. namenode(throw exception) -> router -> client?
b. router(throw exception) -> client?
In my experience we try to scheduler client request to the original router if 
it meet some retry exceptions unless this router set to offline. it may avoid 
this case. To be honest, this is not very graceful solution.

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15079) RBF: Client maybe get an unexpected result with network anomaly

2019-12-25 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003138#comment-17003138
 ] 

Íñigo Goiri commented on HDFS-15079:


The solution might be a little involved in the NN side.
Can we start with a test showing the issue? 
I'd like to go through the debugger. 

> RBF: Client maybe get an unexpected result with network anomaly 
> 
>
> Key: HDFS-15079
> URL: https://issues.apache.org/jira/browse/HDFS-15079
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Priority: Critical
>
>  I find there is a critical problem on RBF, HDFS-15078 can resolve it on some 
> Scenarios, but i have no idea about the overall resolution.
> The problem is that
> Client with RBF(r0, r1) create a file HDFS file via r0, it gets Exception and 
> failovers to r1
> r0 has been send create rpc to namenode(1st create)
> Client create a HDFS file via r1(2nd create)
> Client writes the HDFS file and close it finally(3rd close)
> Maybe namenode receiving the rpc in order as follow
> 2nd create
> 3rd close
> 1st create
> And overwrite is true by default, this would make the file had been written 
> an empty file. This is an critical problem 
> We had encountered this problem. There are many hive and spark jobs running 
> on our cluster,   sometimes it occurs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org