[jira] [Commented] (HADOOP-17021) Add concat fs command

2020-10-11 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212134#comment-17212134
 ] 

Jinglun commented on HADOOP-17021:
--

Thanks very much [~ste...@apache.org] for detailed review and nice suggestions !

> Add concat fs command
> -
>
> Key: HADOOP-17021
> URL: https://issues.apache.org/jira/browse/HADOOP-17021
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: fs
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.3.1
>
> Attachments: HADOOP-17021.001.patch
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> We should add one concat fs command for ease of use. It concatenates existing 
> source files into the target file using FileSystem.concat().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-29 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204004#comment-17204004
 ] 

Jinglun commented on HADOOP-17280:
--

Hi [~tasanuma], thanks your explanation ! Fix it and upload v06.

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch, HADOOP-17280.004.patch, HADOOP-17280.005.patch, 
> HADOOP-17280.006.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-29 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17280:
-
Attachment: HADOOP-17280.006.patch

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch, HADOOP-17280.004.patch, HADOOP-17280.005.patch, 
> HADOOP-17280.006.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-29 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203813#comment-17203813
 ] 

Jinglun commented on HADOOP-17280:
--

Hi [~tasanuma] [~hexiaoqiao], thanks your great comments !
{quote}Please adjust the order of assertEquals at L452-L455 to other places.
{quote}
[~tasanuma] I didn't fully understand of this, beg your pardon. Could you be 
more specific, thanks very much !

 

Other comments are targeted, upload v05 pending jenkins.

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch, HADOOP-17280.004.patch, HADOOP-17280.005.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-29 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17280:
-
Attachment: HADOOP-17280.005.patch

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch, HADOOP-17280.004.patch, HADOOP-17280.005.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-27 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17202752#comment-17202752
 ] 

Jinglun commented on HADOOP-17280:
--

Hi [~hexiaoqiao] [~tasanuma], could you help reviewing this, thanks very much !

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch, HADOOP-17280.004.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-23 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17280:
-
Attachment: HADOOP-17280.004.patch

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch, HADOOP-17280.004.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-23 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201221#comment-17201221
 ] 

Jinglun commented on HADOOP-17280:
--

The compile error is unrelated. Re-submit v03 as v04.

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch, HADOOP-17280.004.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-23 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200838#comment-17200838
 ] 

Jinglun commented on HADOOP-17280:
--

Upload v03. Fix the metrics. Now we have 4 metrics: "DecayedCallVolume", 
"CallVolume", "ServiceUserDecayedCallVolume", "ServiceUserCallVolume". The 
service-user's costs will be added to totalDecayedCallCost and totalRawCallCost 
instead of totalServiceUserDecayedCallCost and totalServiceUserRawCallCost.

Also fix checkstyle and test case. Pending jenkins.

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-23 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17280:
-
Attachment: HADOOP-17280.003.patch

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch, 
> HADOOP-17280.003.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-23 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200674#comment-17200674
 ] 

Jinglun commented on HADOOP-17280:
--

Hi [~tasanuma], thanks your detailed comments ! And your concern makes sense to 
me.
{quote}It becomes difficult to promote higher priority queues for users who are 
in lower priority queues, and they will be penalized more than expected.
{quote}
Yes the promote would be difficult because the total is smaller. I think this 
is good because the service-user would be easier to be scheduled. Because the 
highest priority queue doesn't has many calls. 
{quote}The RPC costs of service-users will no longer appear in CallValume (Raw 
Total incoming Call Volume) metrics.
{quote}
Yes this would be a problem. May be we can add a new metric for service-users? 

 

Since we are trying to guarantee the service quality of service-users, we have 
to penalized other users. Otherwise too many the calls would be set the highest 
priority then the service-users are not guaranteed anymore. So I think we 
should let the other users compete for the left resources.

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-23 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200659#comment-17200659
 ] 

Jinglun commented on HADOOP-17280:
--

Hi [~hexiaoqiao], thanks your nice comments ! Upload v02 with test case.

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-23 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17280:
-
Attachment: HADOOP-17280.002.patch

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch, HADOOP-17280.002.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-22 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200108#comment-17200108
 ] 

Jinglun commented on HADOOP-17280:
--

Hi [~hexiaoqiao] [~chaosun] [~tasanuma], would you like to review this, thanks 
! 

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-22 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17280:
-
Attachment: HADOOP-17280.001.patch
  Assignee: Jinglun
Status: Patch Available  (was: Open)

Submit v01 to illustrate the idea of the change.

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17280.001.patch
>
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17280) Service-user cost shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-22 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17280:
-
Summary: Service-user cost shouldn't be accumulated to totalDecayedCallCost 
and totalRawCallCost.  (was: Service-user in DecayRPCScheduler shouldn't be 
accumulated to totalDecayedCallCost and totalRawCallCost.)

> Service-user cost shouldn't be accumulated to totalDecayedCallCost and 
> totalRawCallCost.
> 
>
> Key: HADOOP-17280
> URL: https://issues.apache.org/jira/browse/HADOOP-17280
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
>
> HADOOP-17165 has introduced a very useful feature: service-user. After this 
> feature I think we shouldn't add the service-user's cost into 
> totalDecayedCallCost and totalRawCallCost anymore. Because it may give all 
> the identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-17280) Service-user in DecayRPCScheduler shouldn't be accumulated to totalDecayedCallCost and totalRawCallCost.

2020-09-22 Thread Jinglun (Jira)
Jinglun created HADOOP-17280:


 Summary: Service-user in DecayRPCScheduler shouldn't be 
accumulated to totalDecayedCallCost and totalRawCallCost.
 Key: HADOOP-17280
 URL: https://issues.apache.org/jira/browse/HADOOP-17280
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Jinglun


HADOOP-17165 has introduced a very useful feature: service-user. After this 
feature I think we shouldn't add the service-user's cost into 
totalDecayedCallCost and totalRawCallCost anymore. Because it may give all the 
identities the priority 0(Supposing we have a big service-user).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-21 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17268:
-
Description: 
My users recently complained 'The NameNode is much slower than before' to me. 
The reason is the cluster and jobs are getting bigger and bigger. So is the 
pressure of NameNode. I explained the pressure was heavy so the rpc requests 
must wait, but they were not satisfied. Because they thought the original 
quality of the service should be guaranteed. They were never told the NameNode 
would be so slow and all their services were built based on the assumption that 
the NameNode would always respond as fast as before.
 From the user's standpoint they are right. So my question is how to give the 
user a guarantee about RPC requests. The natural idea is RPC Quota, just like 
name quota and space quota. The quota can help users to understand the rpc 
requests are also a limit resource. And when they apply quota to the 
administrator, the admin would have the chance to distribute the resource and 
make a plan for the cluster. e.g. We have 200 quota for addBlock and they are 
all allocated. Even the peak doesn't reach 200, I should reject other users 
from applying to reserve the resource. The new user should be mounted to other 
namespaces.
 It's still an initial idea now. I'll think again carefully and make a detailed 
proposal. All advice are welcome!

  was:
My users recently complained 'The NameNode is much slower than before' to me. 
The reason is the cluster and jobs are getting bigger and bigger. So is the 
pressure of NameNode. I explained the pressure was heavy so the rpc requests 
must wait, but they were not satisfied. Because they thought the original 
quality of the service should be guaranteed. They were never told the NameNode 
would be so slow and all their services were built based on the assumption that 
the NameNode would always respond as fast as before.
>From the user's standpoint they are right. So my question is how to give the 
>user a guarantee about RPC requests. The natural idea is RPC Quota, just like 
>name quota and space quota. The quota can help users to understand the rpc 
>requests are also a limit resource. And when they apply quota to the 
>administrator, the admin would have the chance to distribute the resource and 
>make a plan for the cluster. e.g. We have 200 quota for addBlock and they are 
>all allocated. Even the peak doesn't reach 200, I should reject other users 
>from applying to reserve the resource. The new user should be mounted to an
other namespace.
It's still an initial idea now. I'll think again carefully and make a detailed 
proposal.


> Add RPC Quota to NameNode.
> --
>
> Key: HADOOP-17268
> URL: https://issues.apache.org/jira/browse/HADOOP-17268
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17268.001.patch
>
>
> My users recently complained 'The NameNode is much slower than before' to me. 
> The reason is the cluster and jobs are getting bigger and bigger. So is the 
> pressure of NameNode. I explained the pressure was heavy so the rpc requests 
> must wait, but they were not satisfied. Because they thought the original 
> quality of the service should be guaranteed. They were never told the 
> NameNode would be so slow and all their services were built based on the 
> assumption that the NameNode would always respond as fast as before.
>  From the user's standpoint they are right. So my question is how to give the 
> user a guarantee about RPC requests. The natural idea is RPC Quota, just like 
> name quota and space quota. The quota can help users to understand the rpc 
> requests are also a limit resource. And when they apply quota to the 
> administrator, the admin would have the chance to distribute the resource and 
> make a plan for the cluster. e.g. We have 200 quota for addBlock and they are 
> all allocated. Even the peak doesn't reach 200, I should reject other users 
> from applying to reserve the resource. The new user should be mounted to 
> other namespaces.
>  It's still an initial idea now. I'll think again carefully and make a 
> detailed proposal. All advice are welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-21 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199279#comment-17199279
 ] 

Jinglun commented on HADOOP-17268:
--

Hi [~ayushtkn], thanks your comments. Sorry I didn't make it clear. I add a 
description to explain the background and use case. It's still a draft, your 
advice will be very helpful.

And yes, TOO_BUSY is not a good idea because it causes the client to retry. I 
think a quota exceed exception should be more appropriate.

> Add RPC Quota to NameNode.
> --
>
> Key: HADOOP-17268
> URL: https://issues.apache.org/jira/browse/HADOOP-17268
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17268.001.patch
>
>
> My users recently complained 'The NameNode is much slower than before' to me. 
> The reason is the cluster and jobs are getting bigger and bigger. So is the 
> pressure of NameNode. I explained the pressure was heavy so the rpc requests 
> must wait, but they were not satisfied. Because they thought the original 
> quality of the service should be guaranteed. They were never told the 
> NameNode would be so slow and all their services were built based on the 
> assumption that the NameNode would always respond as fast as before.
> From the user's standpoint they are right. So my question is how to give the 
> user a guarantee about RPC requests. The natural idea is RPC Quota, just like 
> name quota and space quota. The quota can help users to understand the rpc 
> requests are also a limit resource. And when they apply quota to the 
> administrator, the admin would have the chance to distribute the resource and 
> make a plan for the cluster. e.g. We have 200 quota for addBlock and they are 
> all allocated. Even the peak doesn't reach 200, I should reject other users 
> from applying to reserve the resource. The new user should be mounted to an
> other namespace.
> It's still an initial idea now. I'll think again carefully and make a 
> detailed proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-21 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17268:
-
Description: 
My users recently complained 'The NameNode is much slower than before' to me. 
The reason is the cluster and jobs are getting bigger and bigger. So is the 
pressure of NameNode. I explained the pressure was heavy so the rpc requests 
must wait, but they were not satisfied. Because they thought the original 
quality of the service should be guaranteed. They were never told the NameNode 
would be so slow and all their services were built based on the assumption that 
the NameNode would always respond as fast as before.
>From the user's standpoint they are right. So my question is how to give the 
>user a guarantee about RPC requests. The natural idea is RPC Quota, just like 
>name quota and space quota. The quota can help users to understand the rpc 
>requests are also a limit resource. And when they apply quota to the 
>administrator, the admin would have the chance to distribute the resource and 
>make a plan for the cluster. e.g. We have 200 quota for addBlock and they are 
>all allocated. Even the peak doesn't reach 200, I should reject other users 
>from applying to reserve the resource. The new user should be mounted to an
other namespace.
It's still an initial idea now. I'll think again carefully and make a detailed 
proposal.

  was:Add the ability of rpc request quota to NameNode. All the requests 
exceeding quota would end with a 'Server too busy' exception. This can prevent 
users from overusing.


> Add RPC Quota to NameNode.
> --
>
> Key: HADOOP-17268
> URL: https://issues.apache.org/jira/browse/HADOOP-17268
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17268.001.patch
>
>
> My users recently complained 'The NameNode is much slower than before' to me. 
> The reason is the cluster and jobs are getting bigger and bigger. So is the 
> pressure of NameNode. I explained the pressure was heavy so the rpc requests 
> must wait, but they were not satisfied. Because they thought the original 
> quality of the service should be guaranteed. They were never told the 
> NameNode would be so slow and all their services were built based on the 
> assumption that the NameNode would always respond as fast as before.
> From the user's standpoint they are right. So my question is how to give the 
> user a guarantee about RPC requests. The natural idea is RPC Quota, just like 
> name quota and space quota. The quota can help users to understand the rpc 
> requests are also a limit resource. And when they apply quota to the 
> administrator, the admin would have the chance to distribute the resource and 
> make a plan for the cluster. e.g. We have 200 quota for addBlock and they are 
> all allocated. Even the peak doesn't reach 200, I should reject other users 
> from applying to reserve the resource. The new user should be mounted to an
> other namespace.
> It's still an initial idea now. I'll think again carefully and make a 
> detailed proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-18 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17268:
-
Attachment: (was: HADOOP-17268.001.patch)

> Add RPC Quota to NameNode.
> --
>
> Key: HADOOP-17268
> URL: https://issues.apache.org/jira/browse/HADOOP-17268
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17268.001.patch
>
>
> Add the ability of rpc request quota to NameNode. All the requests exceeding 
> quota would end with a 'Server too busy' exception. This can prevent users 
> from overusing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-18 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17268:
-
Status: Open  (was: Patch Available)

> Add RPC Quota to NameNode.
> --
>
> Key: HADOOP-17268
> URL: https://issues.apache.org/jira/browse/HADOOP-17268
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17268.001.patch
>
>
> Add the ability of rpc request quota to NameNode. All the requests exceeding 
> quota would end with a 'Server too busy' exception. This can prevent users 
> from overusing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-18 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17268:
-
Attachment: HADOOP-17268.001.patch
Status: Patch Available  (was: Open)

Re-upload patch to trigger jenkins.

> Add RPC Quota to NameNode.
> --
>
> Key: HADOOP-17268
> URL: https://issues.apache.org/jira/browse/HADOOP-17268
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17268.001.patch
>
>
> Add the ability of rpc request quota to NameNode. All the requests exceeding 
> quota would end with a 'Server too busy' exception. This can prevent users 
> from overusing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-17 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17268:
-
Attachment: HADOOP-17268.001.patch
Status: Patch Available  (was: Open)

> Add RPC Quota to NameNode.
> --
>
> Key: HADOOP-17268
> URL: https://issues.apache.org/jira/browse/HADOOP-17268
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-17268.001.patch
>
>
> Add the ability of rpc request quota to NameNode. All the requests exceeding 
> quota would end with a 'Server too busy' exception. This can prevent users 
> from overusing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-17 Thread Jinglun (Jira)
Jinglun created HADOOP-17268:


 Summary: Add RPC Quota to NameNode.
 Key: HADOOP-17268
 URL: https://issues.apache.org/jira/browse/HADOOP-17268
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Jinglun


Add the ability of rpc request quota to NameNode. All the requests exceeding 
quota would end with a 'Server too busy' exception. This can prevent users from 
overusing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Assigned] (HADOOP-17268) Add RPC Quota to NameNode.

2020-09-17 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HADOOP-17268:


Assignee: Jinglun

> Add RPC Quota to NameNode.
> --
>
> Key: HADOOP-17268
> URL: https://issues.apache.org/jira/browse/HADOOP-17268
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
>
> Add the ability of rpc request quota to NameNode. All the requests exceeding 
> quota would end with a 'Server too busy' exception. This can prevent users 
> from overusing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17021) Add concat fs command

2020-05-11 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104989#comment-17104989
 ] 

Jinglun commented on HADOOP-17021:
--

 Hi [~ste...@apache.org] [~weichiu], could you help to review this, thanks !

> Add concat fs command
> -
>
> Key: HADOOP-17021
> URL: https://issues.apache.org/jira/browse/HADOOP-17021
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HADOOP-17021.001.patch
>
>
> We should add one concat fs command for ease of use. It concatenates existing 
> source files into the target file using FileSystem.concat().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17021) Add concat fs command

2020-05-01 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097452#comment-17097452
 ] 

Jinglun commented on HADOOP-17021:
--

Hi [~ste...@apache.org], thanks your comments. Create a PR and link it.

> Add concat fs command
> -
>
> Key: HADOOP-17021
> URL: https://issues.apache.org/jira/browse/HADOOP-17021
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HADOOP-17021.001.patch
>
>
> We should add one concat fs command for ease of use. It concatenates existing 
> source files into the target file using FileSystem.concat().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17021) Add concat fs command

2020-04-30 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-17021:
-
Attachment: HADOOP-17021.001.patch
Status: Patch Available  (was: Open)

> Add concat fs command
> -
>
> Key: HADOOP-17021
> URL: https://issues.apache.org/jira/browse/HADOOP-17021
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HADOOP-17021.001.patch
>
>
> We should add one concat fs command for ease of use. It concatenates existing 
> source files into the target file using FileSystem.concat().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-17021) Add concat fs command

2020-04-30 Thread Jinglun (Jira)
Jinglun created HADOOP-17021:


 Summary: Add concat fs command
 Key: HADOOP-17021
 URL: https://issues.apache.org/jira/browse/HADOOP-17021
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Jinglun


We should add one concat fs command for ease of use. It concatenates existing 
source files into the target file using FileSystem.concat().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Assigned] (HADOOP-17021) Add concat fs command

2020-04-30 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun reassigned HADOOP-17021:


Assignee: Jinglun

> Add concat fs command
> -
>
> Key: HADOOP-17021
> URL: https://issues.apache.org/jira/browse/HADOOP-17021
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
>
> We should add one concat fs command for ease of use. It concatenates existing 
> source files into the target file using FileSystem.concat().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-10-07 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945671#comment-16945671
 ] 

Jinglun commented on HADOOP-16403:
--

Hi [~weichiu], could you help to review v08 ? Thank u very much !

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> HADOOP-16403.007.patch, HADOOP-16403.008.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
> 16/09/2019~20/09/2019
> Recently many users told me they got ConnectTimeoutException when trying to 
> access HDFS. Thanks to this MetricLinkedBlockingQueue I was able to trouble 
> shoot quickly. In my NameNode I configured one Reader with queue size 100, 
> and 256 handlers with callQueue size 25600. I refreshed both reader queue and 
> callQueue with MetricLinkedBlockingQueue, from the LOG I found the reader 
> queue was always full while the callQueue was not. So the bottle neck was the 
> Reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-09-23 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Description: 
I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
after the active dead, it takes the standby more than 40s to become active. 
Many requests(tcp connect request and rpc request) from Datanodes, clients and 
zkfc timed out and start retrying. The suddenly request flood lasts for the 
next 2 minutes and finally all requests are either handled or run out of retry 
times. 
 Adjusting the rpc related settings might power the NameNode and solve this 
problem and the key point is finding the bottle neck. The rpc server can be 
described as below:
{noformat}
Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
By sampling some failed clients, I find many of them got 
ConnectTimeoutException. It's caused by a 20s un-responded tcp connect request. 
I think may be the reader queue is full and block the listener from handling 
new connections. Both slow handlers and slow readers can block the whole 
processing progress, and I need to know who it is. I think *a queue that 
computes the qps, write log when the queue is full and could be replaced 
easily* will help. 
 I find the nice work HADOOP-10302 implementing a runtime-swapped queue. Using 
it at Reader's queue makes the reader queue runtime-swapped automatically. The 
qps computing job could be done by implementing a subclass of LinkedBlockQueue 
that does the computing job while put/take/... happens. The qps data will show 
on jmx.

 

16/09/2019~20/09/2019

Recently many users told me they got ConnectTimeoutException when trying to 
access HDFS. Thanks to this MetricLinkedBlockingQueue I was able to trouble 
shoot quickly. In my NameNode I configured one Reader with queue size 100, and 
256 handlers with callQueue size 25600. I refreshed both reader queue and 
callQueue with MetricLinkedBlockingQueue, from the LOG I found the reader queue 
was always full while the callQueue was not. So the bottle neck was the Reader.

  was:
I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
after the active dead, it takes the standby more than 40s to become active. 
Many requests(tcp connect request and rpc request) from Datanodes, clients and 
zkfc timed out and start retrying. The suddenly request flood lasts for the 
next 2 minutes and finally all requests are either handled or run out of retry 
times. 
 Adjusting the rpc related settings might power the NameNode and solve this 
problem and the key point is finding the bottle neck. The rpc server can be 
described as below:
{noformat}
Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
By sampling some failed clients, I find many of them got 
ConnectTimeoutException. It's caused by a 20s un-responded tcp connect request. 
I think may be the reader queue is full and block the listener from handling 
new connections. Both slow handlers and slow readers can block the whole 
processing progress, and I need to know who it is. I think *a queue that 
computes the qps, write log when the queue is full and could be replaced 
easily* will help. 
 I find the nice work HADOOP-10302 implementing a runtime-swapped queue. Using 
it at Reader's queue makes the reader queue runtime-swapped automatically. The 
qps computing job could be done by implementing a subclass of LinkedBlockQueue 
that does the computing job while put/take/... happens. The qps data will show 
on jmx.

 

 


> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> HADOOP-16403.007.patch, HADOOP-16403.008.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as 

[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-09-05 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0008.patch

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.bak, HADOOP-15565.0006.patch, HADOOP-15565.0007.patch, 
> HADOOP-15565.0008.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-09-05 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923876#comment-16923876
 ] 

Jinglun commented on HADOOP-15565:
--

Hi [~xkrogen], I did a test about the diff 0 issue. When I wget v006 link the 
patch downloaded is actually v007. There must be a bug of Jira. I'll re-upload  
v006 as v006.bak.

Thanks for your advice, very helpful ! Upload patch-008 changing to 
Objects.equals.

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.bak, HADOOP-15565.0006.patch, HADOOP-15565.0007.patch, 
> HADOOP-15565.0008.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-09-05 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0006.bak

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.bak, HADOOP-15565.0006.patch, HADOOP-15565.0007.patch, 
> HADOOP-15565.0008.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-09-05 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0007.patch

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.patch, HADOOP-15565.0007.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-09-05 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923308#comment-16923308
 ] 

Jinglun commented on HADOOP-15565:
--

Hi [~xkrogen], thanks very much for your nice review !  I followed all the 
suggestions and upload patch- 007.
{quote}Can you explain why the changes in 
{{TestViewFileSystemDelegationTokenSupport}} are necessary? Same for 
{{TestViewFileSystemDelegation}} – it seems like the old way of returning the 
created {{fs}} was cleaner?
{quote}
Good question! I change the unit tests because: Before we add the cache, all 
the children filesystems are cached in FileSystem.CACHE. So the filesystem 
instance returned by setupMockFileSystem() is exactly the child filesystem of 
viewFs. After adding the ViewFileSystem.InnerCache, viewFs's children 
filesystem instances are no longer cached in FileSystem.CACHE, so we can't set 
fs1 and fs2 to the FileSystem instance returned by setupMockFileSystem().
{quote}I also don't understand the need for changes in {{testSanity()}} – does 
the string comparison no longer work?
{quote}
About testSanity(), after changing to 
{code:java}
fs1 = (FakeFileSystem) getChildFileSystem((ViewFileSystem) viewFs, new 
URI("fs1:/"));{code}
fs1.getUri() will have a path which is set by ViewFileSystem.InnerCache.get(URI 
uri, Configuratioin config). So comparing the URI.toString() doesn't work any 
more. And I change to compare the scheme and authority.
{quote}Can you describe why the changes in {{TestViewFsDefaultValue}} are 
necessary?
{quote}
Good question! I did the changes because: In 
TestViewFsDefaultValue.clusterSetupAtBegining(), there are 2 Configuration 
instances *CONF* and *conf*. For key _DFS_REPLICATION_KEY_, *CONF* is set to 
_DFS_REPLICATION_DEFAULT + 1_ while *conf* is the default value. Before we have 
the InnerCache, the child filesystem instance of vfs is got from 
FileSystem.CACHE which is constructed with *CONF*. After the InnerCache the 
child filesystem instance is got with *conf*. In the test case 
testGetDefaultReplication(), the default replication is got from the child 
FileSystem instance. When using *CONF* it will be _DFS_REPLICATION_DEFAULT + 1_ 
and when using *conf* it will be _DFS_REPLICATION_DEFAULT_. Because 
testGetDefaultReplication() tests the default replication of the mount point 
path, I set __ to *conf* to 
make it work_._

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.patch, HADOOP-15565.0007.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-29 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918433#comment-16918433
 ] 

Jinglun commented on HADOOP-16403:
--

Hi [~jojochuang], [~xkrogen], patch-008 is ready, could you help review it? 
Thanks !

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> HADOOP-16403.007.patch, HADOOP-16403.008.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-29 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918429#comment-16918429
 ] 

Jinglun edited comment on HADOOP-15565 at 8/29/19 9:11 AM:
---

Hi [~xkrogen] [~jojochuang] , could you help to review this, thanks !


was (Author: lijinglun):
Hi [~xkrogen] [~xkrogen], could you help to review this, thanks !

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-29 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918429#comment-16918429
 ] 

Jinglun commented on HADOOP-15565:
--

Hi [~xkrogen] [~xkrogen], could you help to review this, thanks !

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-22 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913904#comment-16913904
 ] 

Jinglun commented on HADOOP-16403:
--

Hi [~jojochuang] [~xkrogen] [Íñigo 
Goiri|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=elgoiri], 
could you help to review the patch-008 ? Thanks very much !

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> HADOOP-16403.007.patch, HADOOP-16403.008.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-20 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911019#comment-16911019
 ] 

Jinglun commented on HADOOP-15565:
--

Hi [~xkrogen], do you have time to review this?  Thanks~

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-14 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0006.patch

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch, 
> HADOOP-15565.0006.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-14 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906972#comment-16906972
 ] 

Jinglun commented on HADOOP-15565:
--

Thanks [~xkrogen] for your nice comments.
{quote}I don't think publicly exposing cacheSize() on FileSystem is a great 
idea. Can we make it package-private, and if it is needed in non-package-local 
tests, use a test utility to export it publicly?
{quote}
Reasonable! I don't want to start a new file so I'll add the test utility to 
TestFileUtil.java.
{quote}Is there a chance the cache will be accessed in a multi-threaded way? If 
so we need to harden it for concurrent access. Looks like it will only work in 
a single-threaded fashion currently. If the FS instances are actually all 
created on startup, then I think we should explicitly populate the cache on 
startup.
{quote}
The fs instances are all created on startup, I'll make the cache unmodifiable 
so we know it will only be created on startup and won't be modified anymore.
{quote}I agree that swallowing exceptions on child FS close is the right move, 
but probably we should at least put them at INFO level?
{quote}
Right! I'll change it.
{quote}This seems less strict than FileSystem.CACHE when checking for equality; 
it doesn't use the UserGroupInformation at all. I think this is safe because 
the cache is local to a single ViewFileSystem, so all of the inner cached 
instances must share the same UGI, but please help me to confirm.
{quote}
Yes, it's safe. As all the instances share the same UGI we can make the Key 
simple.
{quote}We can use Objects.hash() for the hashCode() method of Key.
{quote}
Right! That's a good practice! I'll update it.
{quote}On ViewFileSystem L257, you shouldn't initialize fs – you can just 
declare it: FileSystem fs; (this allows the compiler to help ensure that you 
remember to initialize it later)
{quote}
Right! I'll update it.

 

Upload patch-006.

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-13 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905934#comment-16905934
 ] 

Jinglun commented on HADOOP-16403:
--

Hi [~jojochuang] [~xkrogen], patch-008 is ready now. Do you have time to review 
it?  Thanks!:)

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> HADOOP-16403.007.patch, HADOOP-16403.008.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-13 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905932#comment-16905932
 ] 

Jinglun commented on HADOOP-15565:
--

Hi [~jojochuang] [~xkrogen] [~gabor.bota] [~hexiaoqiao], patch-005 is ready 
now. Could you help review it, thanks!

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-12 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.008.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> HADOOP-16403.007.patch, HADOOP-16403.008.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-12 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905071#comment-16905071
 ] 

Jinglun commented on HADOOP-16403:
--

Write the documentation of the MetricLinkedBlockingQueue in HADOOP-16506 . 
While writing the doc I found the configuration keys of 
MetricLinkedBlockingQueue are in inconsistent style with FairCallQueue. So I 
updated the key and now they looked consistent. The keys are discribed in 
HADOOP-16506.

Upload patch-007 and wait jenkins.

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> HADOOP-16403.007.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-12 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.007.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> HADOOP-16403.007.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16506) Create proper documentation for MetricLinkedBlockingQueue

2019-08-12 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16506:
-
Attachment: HADOOP-16506.001.patch

> Create proper documentation for MetricLinkedBlockingQueue
> -
>
> Key: HADOOP-16506
> URL: https://issues.apache.org/jira/browse/HADOOP-16506
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16506.001.patch
>
>
> Add documentation for the MetricLinkedBlockingQueue. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-16506) Create proper documentation for MetricLinkedBlockingQueue

2019-08-12 Thread Jinglun (JIRA)
Jinglun created HADOOP-16506:


 Summary: Create proper documentation for MetricLinkedBlockingQueue
 Key: HADOOP-16506
 URL: https://issues.apache.org/jira/browse/HADOOP-16506
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Jinglun
Assignee: Jinglun


Add documentation for the MetricLinkedBlockingQueue. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-11 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904842#comment-16904842
 ] 

Jinglun commented on HADOOP-15565:
--

Thanks a lot [~jojochuang] for pointing it out. Yes it's related and can be 
solved by the inner-cache too. I add a new test case for the deleteOnExit 
situation.

The javac failure are caused by calling deprecated methods in 
TestChrootedFileSystem.java by the old unit tests. I think we can fix the javac 
problem in a new jira. Here we just ignore it.

The unit tests all passed on my local pc except 
[org.apache.hadoop.hdfs.server.datanode.TestLargeBlockReport.testBlockReportSucceedsWithLargerLengthLimit|https://builds.apache.org/job/PreCommit-HADOOP-Build/16464/testReport/org.apache.hadoop.hdfs.server.datanode/TestLargeBlockReport/testBlockReportSucceedsWithLargerLengthLimit/].
 It failed with and without the patch. 

 

Upload patch-005 and wait jenkins.

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-11 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0005.patch

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch, HADOOP-15565.0005.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-09 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903780#comment-16903780
 ] 

Jinglun commented on HADOOP-16403:
--

Hi [~jojochuang], patch-006 is ready now, do you have time to review it ? 
Thanks a lot !

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-09 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0004.patch

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch, HADOOP-15565.0004.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-09 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903600#comment-16903600
 ] 

Jinglun commented on HADOOP-15565:
--

These 6 tests are related. They all rely on FileSystem.CACHE to get the 
children filesystems. Since the children filesystems are not cached in 
FileSystem.CACHE any more, the unit tests failed. The ViewFileSystem already 
has the method getChildFileSystems() and we can use it to fix the unit tests.


org.apache.hadoop.fs.viewfs.TestChRootedFileSystem.testListLocatedFileStatus
org.apache.hadoop.fs.viewfs.TestViewFileSystemDelegation.testAclMethods
org.apache.hadoop.fs.viewfs.TestViewFileSystemDelegation.testVerifyChecksum
org.apache.hadoop.fs.viewfs.TestViewFileSystemDelegationTokenSupport.testGetChildFileSystems
org.apache.hadoop.fs.viewfs.TestViewFileSystemDelegationTokenSupport.testAddDelegationTokens
org.apache.hadoop.fs.viewfs.TestViewFsDefaultValue.testGetDefaultReplication

 

Upload patch-004.

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-08 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0003.patch

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-08 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902986#comment-16902986
 ] 

Jinglun commented on HADOOP-15565:
--

Thanks [~gabor.bota] for your nice comments. I add more unit tests and change 
some variables name to make the code more readable. The description is also 
updated.

Upload patch-003 and pend jenkins.

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch, 
> HADOOP-15565.0003.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2019-08-08 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Description: 
ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
It's children filesystems are cached in FileSystem.CACHE and shared by all the 
ViewFileSystem instances. We could't simply close all the children filesystems 
because it will break the semantic of FileSystem.newInstance().
We might add an inner cache to ViewFileSystem, let it cache all the children 
filesystems. The children filesystems are not shared any more. When 
ViewFileSystem is closed we close all the children filesystems in the inner 
cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't be 
too many FileSystem instances.

The FileSystem.CACHE caches the ViewFileSysem instance and the other 
instances(the children filesystems) are cached in the inner cache.

  was:
When we create a ViewFileSystem, all it's child filesystems will be cached by 
FileSystem.CACHE. Unless we close these child filesystems, they will stay in 
FileSystem.CACHE forever.
I think we should let FileSystem.CACHE cache ViewFileSystem only, and let 
ViewFileSystem cache all it's child filesystems. So we can close ViewFileSystem 
without leak and won't affect other ViewFileSystems.
I find this problem because i need to re-login my kerberos and renew 
ViewFileSystem periodically. Because FileSystem.CACHE.Key is based on 
UserGroupInformation, which changes everytime i re-login, I can't use the 
cached child filesystems when i new a ViewFileSystem. And because 
ViewFileSystem.close does nothing but remove itself from cache, i leak all it's 
child filesystems in cache.


> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch
>
>
> ViewFileSystem.close() does nothing but remove itself from FileSystem.CACHE. 
> It's children filesystems are cached in FileSystem.CACHE and shared by all 
> the ViewFileSystem instances. We could't simply close all the children 
> filesystems because it will break the semantic of FileSystem.newInstance().
> We might add an inner cache to ViewFileSystem, let it cache all the children 
> filesystems. The children filesystems are not shared any more. When 
> ViewFileSystem is closed we close all the children filesystems in the inner 
> cache. The ViewFileSystem is still cached by FileSystem.CACHE so there won't 
> be too many FileSystem instances.
> The FileSystem.CACHE caches the ViewFileSysem instance and the other 
> instances(the children filesystems) are cached in the inner cache.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-02 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.006.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, HADOOP-16403.006.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-01 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: (was: HADOOP-16403.005.patch)

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-01 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.005.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-01 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.005.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-01 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: (was: HADOOP-16403.005.patch)

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-01 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.005.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, HADOOP-16403.005.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-08-01 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897945#comment-16897945
 ] 

Jinglun commented on HADOOP-16403:
--

About shadedclient error, I searched 
[patch-shadedclient.txt|https://builds.apache.org/job/PreCommit-HADOOP-Build/16437/artifact/out/patch-shadedclient.txt]
 and found this:
{quote}[ERROR] Found artifact with unexpected contents: 
'/testptch/hadoop/hadoop-client-modules/hadoop-client-api/target/hadoop-client-api-3.3.0-SNAPSHOT.jar'
 Please check the following and either correct the build or update
 the allowed list with reasoning.

core-default.xml.orig
{quote}
There is a jar check in 
*_./hadoop-client-modules/hadoop-client-check-invariants/src/test/resources/ensure-jars-have-correct-contents.sh_*,
 seems core-default.xml.orig is packaged into 
hadoop-client-api-3.3.0-SNAPSHOT.jar. 

I'm not sure how does this happen. I make a new patch from the latest trunk and 
fix the check styles. Upload patch-005 see if the shadedclient error still 
occurs.

 

 

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-31 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897124#comment-16897124
 ] 

Jinglun commented on HADOOP-16403:
--

Upload patch-004 and pending jenkins. I'll start a new Jira to add more 
documentation guiding how to use MetricLinkedBlockingQueue and how to switch 
reader queue with the new DFSAdmin command and configurations.

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-31 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.004.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-31 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: (was: HADOOP-16403.004.patch)

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-31 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.004.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-31 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: (was: HADOOP-16403.004.patch)

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-31 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.004.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> HADOOP-16403.004.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-27 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16894325#comment-16894325
 ] 

Jinglun commented on HADOOP-16403:
--

Thanks [~jojochuang] for your valuable suggestions. I'll update the patch as 
soon as possible. 

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-23 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.003.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-23 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891572#comment-16891572
 ] 

Jinglun commented on HADOOP-16403:
--

Hi [~jojochuang], thanks for your review and comments. We backported HDFS-8865 
and would update a cluster next week. 
This work is designed for trouble shooting. I can swap to 
MetricLinkedBlockingQueue to shoot the bottle neck when the rpc server is slow.
The test MetricLinkedBlockingQueue(without log) is I removed all the LOG code 
in MetricLinkedBlockingQueue and ran the same test. In the test 
MetricLinkedBlockingQueue is 3.45 times slower than LinkedBlockingQueue so I 
want to know how much does the LOG cost.
I'm not familiar with the doc, I find HDFSCommands.md is related. Is it enough 
to update the doc in HDFSCommands.md?

Upload patch-003 with doc update.

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, HADOOP-16403.003.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-18 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887980#comment-16887980
 ] 

Jinglun commented on HADOOP-16403:
--

Hi [~xkrogen], [~jojochuang], [~ayushtkn],  [~hexiaoqiao] do you have time to 
have a look of this? Let me know your thoughts please.:)

I plan using this to do statistics of RPC Readers and Handlers to have a better 
understand of NameNode. I'll update a production cluster with the statistical 
queue in 2 weeks. Wish I can find something interesting through this new queue.

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-13 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884351#comment-16884351
 ] 

Jinglun commented on HADOOP-16403:
--

Upload How_MetricLinkedBlockingQueue_Works.pdf to explain how 
MetricLinkedBlockingQueue works and the meaning of the statistics.

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-13 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.002.patch

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, HADOOP-16403.002.patch, 
> MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-13 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403-How_MetricLinkedBlockingQueue_Works.pdf, 
> HADOOP-16403.001.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-07 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879812#comment-16879812
 ] 

Jinglun commented on HADOOP-16403:
--

Yes yes, that's where the problem is! Thanks [~hexiaoqiao] for the reference, 
I'll back port it and do a test next week.

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403.001.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-06 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879636#comment-16879636
 ] 

Jinglun commented on HADOOP-16403:
--

Hi [~hexiaoqiao], sorry for replying late. It's a Xiaomi inner version based on 
hadoop 2.6. The Namenode consists of 140,984,043 Inodes, 351238718 blocks, 
3000+ datanodes and runs with 100G heap. The version 2.6 is a very old version 
but we have back ported many updates in the later version so it's not really 
that old. 

I'm considering profiling the transition progress and the lang tail effect 
caused by the long transition. The queue is a tool for it. May be replaying the 
edit log and handling the postponed block reports cost too much time. What do 
you think?

Any advice about how to reproduce the situation and how to shoot the problem? 
Looking forward to your suggestions.:)

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403.001.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-06 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: MetricLinkedBlockingQueueTest.pdf

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403.001.patch, MetricLinkedBlockingQueueTest.pdf
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-06 Thread Jinglun (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16879627#comment-16879627
 ] 

Jinglun commented on HADOOP-16403:
--

 
I do a test comparing MetricLinkedBlockingQueue & LinkedBlockingQueue. I start 
256 producers putting entries to the queue and 256 consumers consuming from 
it(each producer / consumer has it's own thread). The total entries is 
100,000,000, it takes the LinkedBlockingQueue 18180ms to finish and the 
MetricLinkedBlockingQueue 62777ms. If I disable the LOG in 
MetricLinkedBlockingQueue, it takes 28720ms to finish. See the table below.
||Queue||Time cost(ms)||consume rate(put/second)||
|LinkedBlockingQueue|18180|5500550|
|MetricLinkedBlockingQueue|62777|1592940|
|MetricLinkedBlockingQueue(without log)|28270|3537318|

Though there is a significant overhead using MetricLinkedBlockingQueue, it 
still won't be a problem. Because the rpc handling rate of NameNode is hardly 
to reach 30,000, which is much lower than the MetricLinkedBlockingQueue's limit 
1,592,940. 

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403.001.patch
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of LinkedBlockQueue that does the computing job while put/take/... happens. 
> The qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-02 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Description: 
I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
after the active dead, it takes the standby more than 40s to become active. 
Many requests(tcp connect request and rpc request) from Datanodes, clients and 
zkfc timed out and start retrying. The suddenly request flood lasts for the 
next 2 minutes and finally all requests are either handled or run out of retry 
times. 
 Adjusting the rpc related settings might power the NameNode and solve this 
problem and the key point is finding the bottle neck. The rpc server can be 
described as below:
{noformat}
Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
By sampling some failed clients, I find many of them got 
ConnectTimeoutException. It's caused by a 20s un-responded tcp connect request. 
I think may be the reader queue is full and block the listener from handling 
new connections. Both slow handlers and slow readers can block the whole 
processing progress, and I need to know who it is. I think *a queue that 
computes the qps, write log when the queue is full and could be replaced 
easily* will help. 
 I find the nice work HADOOP-10302 implementing a runtime-swapped queue. Using 
it at Reader's queue makes the reader queue runtime-swapped automatically. The 
qps computing job could be done by implementing a subclass of LinkedBlockQueue 
that does the computing job while put/take/... happens. The qps data will show 
on jmx.

 

 

  was:
I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
after the active dead, it takes the standby more than 40s to become active. 
Many requests(tcp connect request and rpc request) from Datanodes, clients and 
zkfc timed out and start retrying. The suddenly request flood lasts for the 
next 2 minutes and finally all requests are either handled or run out of retry 
times. 
Adjusting the rpc related settings might power the NameNode and solve this 
problem and the key point is finding the bottle neck. The rpc server can be 
described as below:
{noformat}
Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
By sampling some failed clients, I find many of them got ConnectException. It's 
caused by a 20s un-responded tcp connect request. I think may be the reader 
queue is full and block the listener from handling new connections. Both slow 
handlers and slow readers can block the whole processing progress, and I need 
to know who it is. I think *a queue that computes the qps, write log when the 
queue is full and could be replaced easily* will help. 
I find the nice work HADOOP-10302 implementing a runtime-swapped queue. Using 
it at Reader's queue makes the reader queue runtime-swapped automatically. The 
qps computing job could be done by implementing a subclass of LinkedBlockQueue 
that does the computing job while put/take/... happens. The qps data will show 
on jmx.

 

 


> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403.001.patch
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
>  Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got 
> ConnectTimeoutException. It's caused by a 20s un-responded tcp connect 
> request. I think may be the reader queue is full and block the listener from 
> handling new connections. Both slow handlers and slow readers can block the 
> whole processing progress, and I need to know who it is. I think *a queue 
> that computes the qps, write log when the queue is full and could be replaced 
> easily* will help. 
>  I find the nice work HADOOP-10302 implementing a runtime-swapped queue. 
> Using it at Reader's queue makes the reader queue runtime-swapped 
> automatically. The qps computing job could be done by implementing a subclass 
> of 

[jira] [Updated] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-01 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-16403:
-
Attachment: HADOOP-16403.001.patch
Status: Patch Available  (was: Open)

patch 001 shows my thoughts about the statistical queue and making the reader 
queue run-time swapped. I move the swap ability from CallQueueManger to a new 
class SwapQueueManager and make CallQueueManager a subclass of it, so I can 
make reader queue run-time swapped. I also add a new class 
MetricLinkedBlockingQueue to compute qps and write queue-full log.

> Start a new statistical rpc queue and make the Reader's pendingConnection 
> queue runtime-replaceable
> ---
>
> Key: HADOOP-16403
> URL: https://issues.apache.org/jira/browse/HADOOP-16403
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-16403.001.patch
>
>
> I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
> after the active dead, it takes the standby more than 40s to become active. 
> Many requests(tcp connect request and rpc request) from Datanodes, clients 
> and zkfc timed out and start retrying. The suddenly request flood lasts for 
> the next 2 minutes and finally all requests are either handled or run out of 
> retry times. 
> Adjusting the rpc related settings might power the NameNode and solve this 
> problem and the key point is finding the bottle neck. The rpc server can be 
> described as below:
> {noformat}
> Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
> By sampling some failed clients, I find many of them got ConnectException. 
> It's caused by a 20s un-responded tcp connect request. I think may be the 
> reader queue is full and block the listener from handling new connections. 
> Both slow handlers and slow readers can block the whole processing progress, 
> and I need to know who it is. I think *a queue that computes the qps, write 
> log when the queue is full and could be replaced easily* will help. 
> I find the nice work HADOOP-10302 implementing a runtime-swapped queue. Using 
> it at Reader's queue makes the reader queue runtime-swapped automatically. 
> The qps computing job could be done by implementing a subclass of 
> LinkedBlockQueue that does the computing job while put/take/... happens. The 
> qps data will show on jmx.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-16403) Start a new statistical rpc queue and make the Reader's pendingConnection queue runtime-replaceable

2019-07-01 Thread Jinglun (JIRA)
Jinglun created HADOOP-16403:


 Summary: Start a new statistical rpc queue and make the Reader's 
pendingConnection queue runtime-replaceable
 Key: HADOOP-16403
 URL: https://issues.apache.org/jira/browse/HADOOP-16403
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Jinglun


I have an HA cluster with 2 NameNodes. The NameNode's meta is quite big so 
after the active dead, it takes the standby more than 40s to become active. 
Many requests(tcp connect request and rpc request) from Datanodes, clients and 
zkfc timed out and start retrying. The suddenly request flood lasts for the 
next 2 minutes and finally all requests are either handled or run out of retry 
times. 
Adjusting the rpc related settings might power the NameNode and solve this 
problem and the key point is finding the bottle neck. The rpc server can be 
described as below:
{noformat}
Listener -> Readers' queues -> Readers -> callQueue -> Handlers{noformat}
By sampling some failed clients, I find many of them got ConnectException. It's 
caused by a 20s un-responded tcp connect request. I think may be the reader 
queue is full and block the listener from handling new connections. Both slow 
handlers and slow readers can block the whole processing progress, and I need 
to know who it is. I think *a queue that computes the qps, write log when the 
queue is full and could be replaced easily* will help. 
I find the nice work HADOOP-10302 implementing a runtime-swapped queue. Using 
it at Reader's queue makes the reader queue runtime-swapped automatically. The 
qps computing job could be done by implementing a subclass of LinkedBlockQueue 
that does the computing job while put/take/... happens. The qps data will show 
on jmx.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Resolved] (HADOOP-16348) Remove redundant code when verify quota.

2019-06-04 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-16348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun resolved HADOOP-16348.
--
  Resolution: Abandoned
Release Note: Should under HDFS.

> Remove redundant code when verify quota.
> 
>
> Key: HADOOP-16348
> URL: https://issues.apache.org/jira/browse/HADOOP-16348
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.1.1
>Reporter: Jinglun
>Priority: Minor
>
> DirectoryWithQuotaFeature.verifyQuotaByStorageType() does the job of 
> verifying quota. It's redundant to call isQuotaByStorageTypeSet() because the 
> for each iterator nextline has done the same job.
> {code:java}
> if (!isQuotaByStorageTypeSet()) { // REDUNDANT.
>   return;
> }
> for (StorageType t: StorageType.getTypesSupportingQuota()) {
>   if (!isQuotaByStorageTypeSet(t)) { // CHECK FOR EACH STORAGETYPE.
> continue;
>   }
>   if (Quota.isViolated(quota.getTypeSpace(t), usage.getTypeSpace(t),
>   typeDelta.get(t))) {
> throw new QuotaByStorageTypeExceededException(
> quota.getTypeSpace(t), usage.getTypeSpace(t) + typeDelta.get(t), t);
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-16348) Remove redundant code when verify quota.

2019-06-04 Thread Jinglun (JIRA)
Jinglun created HADOOP-16348:


 Summary: Remove redundant code when verify quota.
 Key: HADOOP-16348
 URL: https://issues.apache.org/jira/browse/HADOOP-16348
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.1.1
Reporter: Jinglun


DirectoryWithQuotaFeature.verifyQuotaByStorageType() does the job of verifying 
quota. It's redundant to call isQuotaByStorageTypeSet() because the for each 
iterator nextline has done the same job.
{code:java}
if (!isQuotaByStorageTypeSet()) { // REDUNDANT.
  return;
}
for (StorageType t: StorageType.getTypesSupportingQuota()) {
  if (!isQuotaByStorageTypeSet(t)) { // CHECK FOR EACH STORAGETYPE.
continue;
  }
  if (Quota.isViolated(quota.getTypeSpace(t), usage.getTypeSpace(t),
  typeDelta.get(t))) {
throw new QuotaByStorageTypeExceededException(
quota.getTypeSpace(t), usage.getTypeSpace(t) + typeDelta.get(t), t);
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15946) the Connection thread should notify all calls in finally clause before quit.

2018-11-22 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15946:
-
Attachment: HADOOP-15946.patch

> the Connection thread should notify all calls in finally clause before quit.
> 
>
> Key: HADOOP-15946
> URL: https://issues.apache.org/jira/browse/HADOOP-15946
> Project: Hadoop Common
>  Issue Type: Improvement
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-15946.patch, issue-replay.patch
>
>
> Threads that call Client.call() would wait forever unless the connection 
> thread notifies them, so the connection thread should try it's best to notify 
> when it's going to quit.
> In Connection.close(), if any Throwable occurs before cleanupCalls(), the 
> connection thread will quit directly and leave all the waiting threads 
> waiting forever. So i think doing cleanupCalls() in finally clause might be a 
> good idea.
> I met this problem when i started a hadoop2.6 DataNode with 8 block pools. 
> The DN successfully reported to 7 Namespaces and failed at the last Namespace 
> because the connection thread of the heartbeat rpc got a "OOME:Direct buffer 
> memory" and quit without calling cleanupCalls().
> I think we can move cleanupCalls() to finally clause as a protection, because 
> i notice in HADOOP-10940 the close of stream is changed to 
> IOUtils.closeStream(ipcStreams) which catches all Throwable, so the problem i 
> met was fixed. 
> issue-replay.patch simulates the case i described above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-15946) the Connection thread should notify all calls in finally clause before quit.

2018-11-22 Thread Jinglun (JIRA)
Jinglun created HADOOP-15946:


 Summary: the Connection thread should notify all calls in finally 
clause before quit.
 Key: HADOOP-15946
 URL: https://issues.apache.org/jira/browse/HADOOP-15946
 Project: Hadoop Common
  Issue Type: Improvement
Reporter: Jinglun
 Attachments: issue-replay.patch

Threads that call Client.call() would wait forever unless the connection thread 
notifies them, so the connection thread should try it's best to notify when 
it's going to quit.

In Connection.close(), if any Throwable occurs before cleanupCalls(), the 
connection thread will quit directly and leave all the waiting threads waiting 
forever. So i think doing cleanupCalls() in finally clause might be a good idea.

I met this problem when i started a hadoop2.6 DataNode with 8 block pools. The 
DN successfully reported to 7 Namespaces and failed at the last Namespace 
because the connection thread of the heartbeat rpc got a "OOME:Direct buffer 
memory" and quit without calling cleanupCalls().

I think we can move cleanupCalls() to finally clause as a protection, because i 
notice in HADOOP-10940 the close of stream is changed to 
IOUtils.closeStream(ipcStreams) which catches all Throwable, so the problem i 
met was fixed. 

issue-replay.patch simulates the case i described above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2018-06-27 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0002.patch
Status: Patch Available  (was: Open)

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch, HADOOP-15565.0002.patch
>
>
> When we create a ViewFileSystem, all it's child filesystems will be cached by 
> FileSystem.CACHE. Unless we close these child filesystems, they will stay in 
> FileSystem.CACHE forever.
> I think we should let FileSystem.CACHE cache ViewFileSystem only, and let 
> ViewFileSystem cache all it's child filesystems. So we can close 
> ViewFileSystem without leak and won't affect other ViewFileSystems.
> I find this problem because i need to re-login my kerberos and renew 
> ViewFileSystem periodically. Because FileSystem.CACHE.Key is based on 
> UserGroupInformation, which changes everytime i re-login, I can't use the 
> cached child filesystems when i new a ViewFileSystem. And because 
> ViewFileSystem.close does nothing but remove itself from cache, i leak all 
> it's child filesystems in cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2018-06-27 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Status: Open  (was: Patch Available)

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch
>
>
> When we create a ViewFileSystem, all it's child filesystems will be cached by 
> FileSystem.CACHE. Unless we close these child filesystems, they will stay in 
> FileSystem.CACHE forever.
> I think we should let FileSystem.CACHE cache ViewFileSystem only, and let 
> ViewFileSystem cache all it's child filesystems. So we can close 
> ViewFileSystem without leak and won't affect other ViewFileSystems.
> I find this problem because i need to re-login my kerberos and renew 
> ViewFileSystem periodically. Because FileSystem.CACHE.Key is based on 
> UserGroupInformation, which changes everytime i re-login, I can't use the 
> cached child filesystems when i new a ViewFileSystem. And because 
> ViewFileSystem.close does nothing but remove itself from cache, i leak all 
> it's child filesystems in cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2018-06-26 Thread Jinglun (JIRA)


 [ 
https://issues.apache.org/jira/browse/HADOOP-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HADOOP-15565:
-
Attachment: HADOOP-15565.0001.patch
Status: Patch Available  (was: Open)

> ViewFileSystem.close doesn't close child filesystems and causes FileSystem 
> objects leak.
> 
>
> Key: HADOOP-15565
> URL: https://issues.apache.org/jira/browse/HADOOP-15565
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Jinglun
>Priority: Major
> Attachments: HADOOP-15565.0001.patch
>
>
> When we create a ViewFileSystem, all it's child filesystems will be cached by 
> FileSystem.CACHE. Unless we close these child filesystems, they will stay in 
> FileSystem.CACHE forever.
> I think we should let FileSystem.CACHE cache ViewFileSystem only, and let 
> ViewFileSystem cache all it's child filesystems. So we can close 
> ViewFileSystem without leak and won't affect other ViewFileSystems.
> I find this problem because i need to re-login my kerberos and renew 
> ViewFileSystem periodically. Because FileSystem.CACHE.Key is based on 
> UserGroupInformation, which changes everytime i re-login, I can't use the 
> cached child filesystems when i new a ViewFileSystem. And because 
> ViewFileSystem.close does nothing but remove itself from cache, i leak all 
> it's child filesystems in cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-15565) ViewFileSystem.close doesn't close child filesystems and causes FileSystem objects leak.

2018-06-26 Thread Jinglun (JIRA)
Jinglun created HADOOP-15565:


 Summary: ViewFileSystem.close doesn't close child filesystems and 
causes FileSystem objects leak.
 Key: HADOOP-15565
 URL: https://issues.apache.org/jira/browse/HADOOP-15565
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Jinglun


When we create a ViewFileSystem, all it's child filesystems will be cached by 
FileSystem.CACHE. Unless we close these child filesystems, they will stay in 
FileSystem.CACHE forever.
I think we should let FileSystem.CACHE cache ViewFileSystem only, and let 
ViewFileSystem cache all it's child filesystems. So we can close ViewFileSystem 
without leak and won't affect other ViewFileSystems.
I find this problem because i need to re-login my kerberos and renew 
ViewFileSystem periodically. Because FileSystem.CACHE.Key is based on 
UserGroupInformation, which changes everytime i re-login, I can't use the 
cached child filesystems when i new a ViewFileSystem. And because 
ViewFileSystem.close does nothing but remove itself from cache, i leak all it's 
child filesystems in cache.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org