from:"Yongjun Zhang \(JIRA\)"

[jira] [Commented] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2019-07-09 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881469#comment-16881469
 ] 

Yongjun Zhang commented on HDFS-9178:
-

HI [~kihwal], many thanks for the work here! 

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.8.0, 2.7.2, 2.6.4, 3.0.0-alpha1
>
> Attachments: 002-HDFS-9178.branch-2.6.patch, 
> HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even though the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2019-07-09 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-9178:

Description: 
When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
downstream node can timeout on reading packet since even the heartbeat packets 
will not be relayed down.  

The packet read timeout is set in {{DataXceiver#run()}}:

{code}
  peer.setReadTimeout(dnConf.socketTimeout);
{code}

When the downstream node times out and closes the connection to the upstream, 
the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
ack upstream with the downstream node status set to {{ERROR}}.  This caused the 
client to exclude the downstream node, even though the upstream node was the 
one got stuck.

The connection to downstream has longer timeout, so the downstream will always 
timeout  first. The downstream timeout is set in {{writeBlock()}}
{code}
  int timeoutValue = dnConf.socketTimeout +
  (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
  int writeTimeout = dnConf.socketWriteTimeout +
  (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
  NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
  OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
  writeTimeout);
{code}

  was:
When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
downstream node can timeout on reading packet since even the heartbeat packets 
will not be relayed down.  

The packet read timeout is set in {{DataXceiver#run()}}:

{code}
  peer.setReadTimeout(dnConf.socketTimeout);
{code}

When the downstream node times out and closes the connection to the upstream, 
the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
ack upstream with the downstream node status set to {{ERROR}}.  This caused the 
client to exclude the downstream node, even thought the upstream node was the 
one got stuck.

The connection to downstream has longer timeout, so the downstream will always 
timeout  first. The downstream timeout is set in {{writeBlock()}}
{code}
  int timeoutValue = dnConf.socketTimeout +
  (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
  int writeTimeout = dnConf.socketWriteTimeout +
  (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
  NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
  OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
  writeTimeout);
{code}


> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.8.0, 2.7.2, 2.6.4, 3.0.0-alpha1
>
> Attachments: 002-HDFS-9178.branch-2.6.patch, 
> HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even though the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14083) libhdfs logs errors when opened FS doesn't support ByteBufferReadable

2019-02-26 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778529#comment-16778529
 ] 

Yongjun Zhang commented on HDFS-14083:
--

HI guys,

I took a look and I agree with [~tlipcon]'s comments with additional thoughts. 

1. Given errno is thread safe (per 
http://www.unix.org/whitepapers/reentrant.html), we should have readDirect to 
initialize errno to 0, and set it to other values upon failures.

2. . Besides the static variables are not thead-safe (Todd pointed out it might 
be ok), the naming is also too generic since they are intended for readDirect. 
{code}
static time_t last_reported_err_time = 0;
static long last_reported_err_cnt = 0;
 {code}
 Maybe we can change the variable name to include "_read_direct" to be 
specific? 

3. If HADOOP-14603 is fixed, then we will not have this excessive logging 
issue, but it doesn't seem to hurt to have HDFS-14083 as an interim fix, and it 
can stay as is even after HADOOP-14603 fix.

Wonder if you agree [~tlipcon] and other folks?

Thanks.

> libhdfs logs errors when opened FS doesn't support ByteBufferReadable
> -
>
> Key: HDFS-14083
> URL: https://issues.apache.org/jira/browse/HDFS-14083
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: libhdfs, native
>Affects Versions: 3.0.3
>Reporter: Pranay Singh
>Assignee: Pranay Singh
>Priority: Minor
> Attachments: HADOOP-15928.001.patch, HADOOP-15928.002.patch, 
> HDFS-14083.003.patch, HDFS-14083.004.patch, HDFS-14083.005.patch, 
> HDFS-14083.006.patch, HDFS-14083.007.patch, HDFS-14083.008.patch, 
> HDFS-14083.009.patch
>
>
> Problem:
> 
> There is excessive error logging when a file is opened by libhdfs 
> (DFSClient/HDFS) in S3 environment, this issue is caused because buffered 
> read is not supported in S3 environment, HADOOP-14603 "S3A input stream to 
> support ByteBufferReadable"  
> The following message is printed repeatedly in the error log/ to STDERR:
> {code}
> --
> UnsupportedOperationException: Byte-buffer read unsupported by input 
> streamjava.lang.UnsupportedOperationException: Byte-buffer read unsupported 
> by input stream
> at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:150)
> {code}
> h3. Root cause
> After investigating the issue, it appears that the above exception is printed 
> because
> when a file is opened via {{hdfsOpenFileImpl()}} calls {{readDirect()}} which 
> is hitting this
> exception.
> h3. Fix:
> Since the hdfs client is not initiating the byte buffered read but is 
> happening in a implicit manner, we should not be generating the error log 
> during open of a file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14118) Support using DNS to resolve nameservices to IP addresses

2019-02-23 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-14118:
-
   Resolution: Fixed
Fix Version/s: 3.3.0
   Status: Resolved  (was: Patch Available)

> Support using DNS to resolve nameservices to IP addresses
> -
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.024.patch, HDFS-14118.patch
>
>
> In router based federation (RBF), clients will need to know about routers to 
> talk to the HDFS cluster (obviously), and having routers updating 
> (adding/removing) will have to make config change in every client, which is a 
> painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.
> See the attached design document for details about the proposed solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14118) Support using DNS to resolve nameservices to IP addresses

2019-02-23 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775976#comment-16775976
 ] 

Yongjun Zhang commented on HDFS-14118:
--

HI [~fengnanli], would you please add content to the "Release Note" section of 
the jira? thanks.

> Support using DNS to resolve nameservices to IP addresses
> -
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.024.patch, HDFS-14118.patch
>
>
> In router based federation (RBF), clients will need to know about routers to 
> talk to the HDFS cluster (obviously), and having routers updating 
> (adding/removing) will have to make config change in every client, which is a 
> painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.
> See the attached design document for details about the proposed solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14118) Support using DNS to resolve nameservices to IP addresses

2019-02-23 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-14118:
-
Hadoop Flags: Reviewed

> Support using DNS to resolve nameservices to IP addresses
> -
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.024.patch, HDFS-14118.patch
>
>
> In router based federation (RBF), clients will need to know about routers to 
> talk to the HDFS cluster (obviously), and having routers updating 
> (adding/removing) will have to make config change in every client, which is a 
> painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.
> See the attached design document for details about the proposed solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14118) Support using DNS to resolve nameservices to IP addresses

2019-02-23 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775974#comment-16775974
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Committed to trunk.

Thanks [~fengnanli] for the contribution and [~elgoiri] for the review!

> Support using DNS to resolve nameservices to IP addresses
> -
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.024.patch, HDFS-14118.patch
>
>
> In router based federation (RBF), clients will need to know about routers to 
> talk to the HDFS cluster (obviously), and having routers updating 
> (adding/removing) will have to make config change in every client, which is a 
> painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.
> See the attached design document for details about the proposed solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14118) Support using DNS to resolve nameservices to IP addresses

2019-02-23 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-14118:
-
Description: 
In router based federation (RBF), clients will need to know about routers to 
talk to the HDFS cluster (obviously), and having routers updating 
(adding/removing) will have to make config change in every client, which is a 
painful process.

DNS can be used here to resolve the single domain name clients knows to a list 
of routers in the current config. However, DNS won't be able to consider only 
resolving to the working router based on certain health thresholds.

There are some ways about how this can be solved. One way is to have a separate 
script to regularly check the status of the router and update the DNS records 
if a router fails the health thresholds. In this way, security might be 
carefully considered for this way. Another way is to have the client do the 
normal connecting/failover after they get the list of routers, which requires 
the change of current failover proxy provider.

See the attached design document for details about the proposed solution.

  was:
Clients will need to know about routers to talk to the HDFS cluster 
(obviously), and having routers updating (adding/removing) will have to make 
every client change, which is a painful process.

DNS can be used here to resolve the single domain name clients knows to a list 
of routers in the current config. However, DNS won't be able to consider only 
resolving to the working router based on certain health thresholds.

There are some ways about how this can be solved. One way is to have a separate 
script to regularly check the status of the router and update the DNS records 
if a router fails the health thresholds. In this way, security might be 
carefully considered for this way. Another way is to have the client do the 
normal connecting/failover after they get the list of routers, which requires 
the change of current failover proxy provider.


> Support using DNS to resolve nameservices to IP addresses
> -
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.024.patch, HDFS-14118.patch
>
>
> In router based federation (RBF), clients will need to know about routers to 
> talk to the HDFS cluster (obviously), and having routers updating 
> (adding/removing) will have to make config change in every client, which is a 
> painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.
> See the attached design document for details about the proposed solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14118) Support using DNS to resolve nameservices to IP addresses

2019-02-23 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-14118:
-
Summary: Support using DNS to resolve nameservices to IP addresses  (was: 
Use DNS to resolve Namenodes and Routers)

> Support using DNS to resolve nameservices to IP addresses
> -
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.024.patch, HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-22 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775817#comment-16775817
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Hi [~fengnanli],

Seems we have quite some flaky tests, I manually ran all failed tests and they 
passed. I saw another place to fix in hdfs-default.xml of rev 23, so I'm 
uploading rev 24 instead of asking you to iterate again. The change in rev 24 
is very minor and it's just to make sure the version we are committing is the 
same as the last version we uploaded here.

Hi [~elgoiri], I saw your comment 
[here|https://issues.apache.org/jira/browse/HDFS-14118?focusedCommentId=16761094=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16761094],
 so I am taking it as a +1 from you, and I will go ahead to commit it soon.

Thanks.

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-22 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-14118:
-
Attachment: HDFS-14118.024.patch

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.024.patch, HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-22 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775724#comment-16775724
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Thanks [~fengnanli]! +1 on rev 23 pending jenkins test.

Hi [~elgoiri], wonder if you have further comments? Thanks.

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.023.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-22 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775474#comment-16775474
 ] 

Yongjun Zhang edited comment on HDFS-14118 at 2/22/19 6:33 PM:
---

Hi [~fengnanli],

Thanks for following-up. Would like to suggest some small changes in the config 
description. Hope the suggested changes make sense to you:
  
{code:java}

 dfs.client.failover.random.order
 false
 
 Determines if the failover proxies are picked in random order instead of the
 configured order. Random order may be enabled for better load balancing
 or to avoid always hitting failed ones first if the failed ones appear in the
 beginning of the configured or resolved list.
 For example, In the case of multiple RBF routers or ObserverNameNodes,
 it is recommended to be turned on for load balancing. 
 The config name can be extended with an optional nameservice ID
 (of form dfs.client.failover.random.order[.nameservice]) in case multiple
 nameservices exist and random order should be enabled for specific
 nameservices.
 



 dfs.client.failover.resolve-needed
 false
 
 Determines if the given nameservice address is a domain name which needs to
 be resolved (using the resolver configured by 
dfs.client.failover.resolver-impl).
 This adds a transparency layer in the client so physical server address
 can change without changing the client. The config name can be extended with
 an optional nameservice ID (of form 
dfs.client.failover.resolve-needed[.nameservice])
 to configure specific nameservices when multiple nameservices exist.
 



 dfs.client.failover.resolver.impl
 org.apache.hadoop.net.DNSDomainNameResolver
 
 Determines what class to use to resolve name service domain name to specific
 machine address. The config name can be extended with an optional
 nameservice ID (of form dfs.client.failover.resolver.impl[.nameservice]) to 
configure
 specific nameservices when multiple nameservices exist.
 

{code}
 


was (Author: yzhangal):
Hi [~fengnanli],

Thanks for following-up. Would like to suggest some small changes in the config 
description. Hope the suggested changes make sense to you:
  
{code:java}

 dfs.client.failover.random.order
 false
 
 Determines if the failover proxies are picked in random order instead of the
 configured order. Random order may be enabled for better load balancing
 or to avoid always hitting failed ones first if the failed ones appear in the
 beginning of the configured or resolved list.
 For example, In the case of multiple RBF routers or ObserverNameNodes,
 it is recommended to be turned on for load balancing. 
 The config name can be extended with an optional nameservice ID
 (of form dfs.client.failover.random.order[.nameservice]) in case multiple
 nameservices exist and random order should be enabled for specific
 nameservices.
 



 dfs.client.failover.resolve-needed
 false
 
 Determines if the given namenode address is a domain name which needs to
 be resolved (using the resolver configured by 
dfs.client.failover.resolver-impl).
 This adds a transparency layer in the client so physical namenode address
 can change without changing the client. The config name can be extended with
 an optional nameservice ID (of form 
dfs.client.failover.resolve-needed[.nameservice])
 to configure specific nameservices when multiple nameservices exist.
 



 dfs.client.failover.resolver.impl
 org.apache.hadoop.net.DNSDomainNameResolver
 
 Determines what class to use to resolve name service domain name to specific
 machine address. The config name can be extended with an optional
 nameservice ID (of form dfs.client.failover.resolver.impl[.nameservice]) to 
configure
 specific nameservices when multiple nameservices exist.
 

{code}
 

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-22 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775474#comment-16775474
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Hi [~fengnanli],

Thanks for following-up. Would like to suggest some small changes in the config 
description. Hope the suggested changes make sense to you:
  
{code:java}

 dfs.client.failover.random.order
 false
 
 Determines if the failover proxies are picked in random order instead of the
 configured order. Random order may be enabled for better load balancing
 or to avoid always hitting failed ones first if the failed ones appear in the
 beginning of the configured or resolved list.
 For example, In the case of multiple RBF routers or ObserverNameNodes,
 it is recommended to be turned on for load balancing. 
 The config name can be extended with an optional nameservice ID
 (of form dfs.client.failover.random.order[.nameservice]) in case multiple
 nameservices exist and random order should be enabled for specific
 nameservices.
 



 dfs.client.failover.resolve-needed
 false
 
 Determines if the given namenode address is a domain name which needs to
 be resolved (using the resolver configured by 
dfs.client.failover.resolver-impl).
 This adds a transparency layer in the client so physical namenode address
 can change without changing the client. The config name can be extended with
 an optional nameservice ID (of form 
dfs.client.failover.resolve-needed[.nameservice])
 to configure specific nameservices when multiple nameservices exist.
 



 dfs.client.failover.resolver.impl
 org.apache.hadoop.net.DNSDomainNameResolver
 
 Determines what class to use to resolve name service domain name to specific
 machine address. The config name can be extended with an optional
 nameservice ID (of form dfs.client.failover.resolver.impl[.nameservice]) to 
configure
 specific nameservices when multiple nameservices exist.
 

{code}
 

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.021.patch, HDFS-14118.022.patch, HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-21 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774848#comment-16774848
 ] 

Yongjun Zhang edited comment on HDFS-14118 at 2/22/19 7:01 AM:
---

Thanks for the explanation and new rev [~fengnanli].

A few more comments:
 # In hdfs-default.xml, remove the statement about "Random order should be ..." 
for both dfs.client.failover.resolve-needed and 
dfs.client.failover.resolver.impl, since this is already mentioned in the 
description of dfs.client.failover.random.order.
 # The description for dfs.client.failover.random.order can be improved to 
provide some more details about when to enable random and when not to. For 
example, for HA Namenodes, we may not need to enable; for ObserverNodes and RBF 
routers, it may be helpful to enable because we want to spread the load. This 
can be deferred to a new jira if you'd like.
 # Create a new Jira to fix the relevant doc places as [~elgoiri] pointed out 
[here | 
https://issues.apache.org/jira/browse/HDFS-14118?focusedCommentId=16773396=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16773396]
 .

My +1 after that.

Thanks.


was (Author: yzhangal):
Thanks for the explanation and new rev [~fengnanli].

A few more comments:
 # In hdfs-default.xml, remove the statement about "Random order should be ..." 
for both dfs.client.failover.resolve-needed and 
dfs.client.failover.resolver.impl, since this is already mentioned in the 
description of dfs.client.failover.random.order.
 # The description for dfs.client.failover.random.order can be improved to 
provide some more details about when to enable random and when not to. For 
example, for HA Namenodes, we may not need to enable; for ObserverNodes and RBF 
routers, it may be helpful to enable because we want to spread the load. This 
can be deferred to a new jira if you'd like.
 # Create a new Jira to fix the relevant doc places as [~elgoiri] pointed out 
here .

My +1 after that.

Thanks.

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-21 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774848#comment-16774848
 ] 

Yongjun Zhang edited comment on HDFS-14118 at 2/22/19 7:03 AM:
---

Thanks for the explanation and new rev [~fengnanli].

A few more comments:
 # In hdfs-default.xml, remove the statement about "Random order should be ..." 
for both dfs.client.failover.resolve-needed and 
dfs.client.failover.resolver.impl, since this is already mentioned in the 
description of dfs.client.failover.random.order.
 # The description for dfs.client.failover.random.order can be improved to 
provide some more details about when to enable random and when not to. For 
example, for HA Namenodes, we may not need to enable; for ObserverNodes and RBF 
routers, it may be helpful to enable because we want to spread the load. This 
can be deferred to a new jira if you'd like.
 # Create a new Jira to fix the relevant doc places as [~elgoiri] pointed out 
here , plus relevant doc about read on standbyNN.

My +1 after that.

Thanks.


was (Author: yzhangal):
Thanks for the explanation and new rev [~fengnanli].

A few more comments:
 # In hdfs-default.xml, remove the statement about "Random order should be ..." 
for both dfs.client.failover.resolve-needed and 
dfs.client.failover.resolver.impl, since this is already mentioned in the 
description of dfs.client.failover.random.order.
 # The description for dfs.client.failover.random.order can be improved to 
provide some more details about when to enable random and when not to. For 
example, for HA Namenodes, we may not need to enable; for ObserverNodes and RBF 
routers, it may be helpful to enable because we want to spread the load. This 
can be deferred to a new jira if you'd like.
 # Create a new Jira to fix the relevant doc places as [~elgoiri] pointed out 
[here | 
https://issues.apache.org/jira/browse/HDFS-14118?focusedCommentId=16773396=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16773396]
 .

My +1 after that.

Thanks.

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-21 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774848#comment-16774848
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Thanks for the explanation and new rev [~fengnanli].

A few more comments:

 # In hdfs-default.xml, remove the statement about "Random order should be ..." 
for both dfs.client.failover.resolve-needed and 
dfs.client.failover.resolver.impl, since this is already mentioned in the 
description of dfs.client.failover.random.order. 
 # The description for dfs.client.failover.random.order can be improved to 
provide some more details about when to enable random and when not to. For 
example, for HA Namenodes, we may not need to enable; for ObserverNodes and RBF 
routers, it may be helpful to enable because we want to spread the load. This 
can be deferred to a new jira if you'd like.
 # Create a new Jira to fix the relevant doc places as [~elgoiri] pointed out 
[here | 
https://issues.apache.org/jira/browse/HDFS-14118?focusedCommentId=16773396=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16773396].
 


My +1 after that.

Thanks.


 

 

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-21 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774848#comment-16774848
 ] 

Yongjun Zhang edited comment on HDFS-14118 at 2/22/19 6:58 AM:
---

Thanks for the explanation and new rev [~fengnanli].

A few more comments:
 # In hdfs-default.xml, remove the statement about "Random order should be ..." 
for both dfs.client.failover.resolve-needed and 
dfs.client.failover.resolver.impl, since this is already mentioned in the 
description of dfs.client.failover.random.order.
 # The description for dfs.client.failover.random.order can be improved to 
provide some more details about when to enable random and when not to. For 
example, for HA Namenodes, we may not need to enable; for ObserverNodes and RBF 
routers, it may be helpful to enable because we want to spread the load. This 
can be deferred to a new jira if you'd like.
 # Create a new Jira to fix the relevant doc places as [~elgoiri] pointed out 
here .

My +1 after that.

Thanks.


was (Author: yzhangal):
Thanks for the explanation and new rev [~fengnanli].

A few more comments:

 # In hdfs-default.xml, remove the statement about "Random order should be ..." 
for both dfs.client.failover.resolve-needed and 
dfs.client.failover.resolver.impl, since this is already mentioned in the 
description of dfs.client.failover.random.order. 
 # The description for dfs.client.failover.random.order can be improved to 
provide some more details about when to enable random and when not to. For 
example, for HA Namenodes, we may not need to enable; for ObserverNodes and RBF 
routers, it may be helpful to enable because we want to spread the load. This 
can be deferred to a new jira if you'd like.
 # Create a new Jira to fix the relevant doc places as [~elgoiri] pointed out 
[here | 
https://issues.apache.org/jira/browse/HDFS-14118?focusedCommentId=16773396=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16773396].
 


My +1 after that.

Thanks.


 

 

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-21 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774569#comment-16774569
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Thanks [~fengnanli], would you please address the questions I asked in #4? 
thanks.

 

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs-1.pdf, HDFS design doc_ Single domain name for clients 
> - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.020.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-21 Thread Yongjun Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773661#comment-16773661
]

Yongjun Zhang edited comment on HDFS-14118 at 2/21/19 3:37 PM:
---

Hi Guys,

I did one round of review, and it largely looks good to me, except for some
cosmetic things. Good work [~fengnanli]!

To save iterations, please consider addressing the question in #4 first before
having a new revision.

1. DomainNameResolver.java

The class name here is generic, however, the comment stated that this class
is for namenode. The jira also talked about router (for RBF). Suggest to change
the comment "for the failover proxy to get IP addresses for the namenode" to
"for failover proxies to get IP addresses of the associated servers
(NameNodes, RBF routers etc)"
{code:java}
/**

* This interface provides methods for the failover proxy to get IP addresses

* for the namenode. Implementations will use their own service discovery

* mechanism, DNS, Zookeeper etc

public interface DomainNameResolver {
{code}
2. code-default.xml
The description can be changed to
"The implementation of DomainNameResolver used for service (NameNodes,
RBF Routers etc) discovery. The default implementation
org.apache.hadoop.net.DNSDomainNameResolver returns all IP addresses associated
with the input domain name of the services by querying the underlying DNS."

3. AbstractNNFailoverProxyProvider.java:
{code:java}
String host = nameNodeUri.getHost();
String configKeyWithHost =
HdfsClientConfigKeys.Failover.RESOLVE_ADDRESS_NEEDED_KEY + "." + host;
boolean resolveNeeded = conf.getBoolean(configKeyWithHost,
HdfsClientConfigKeys.Failover.RESOLVE_ADDRESS_NEEDED_DEFAULT);
{code}
Most of the time the 'host' here is NN, router etc nameservice instead of host
name.
Suggest to add a comment here like:
// 'host' here is usually the ID of the nameservice when address resolving is
needed.
to make it easier to read

4. hdfs-default.xml

Suggest to make the following change (some typo, and some are re-arrangement).
Also
would you please explain when and why "random order should be enabled" and
when it
it's not needed? It seems not clear here.
{code:java}

dfs.client.failover.resolve-needed
false

Determines if the given namenode address is a domain name which needs to be
resolved (using the resolver configured by
dfs.client.failover.resolver-impl).
This adds a transparency layer in the client so physical namenode address
can change without changing the client. The config key can be appended with
an optional
nameservice ID (of form dfs.client.failover.resolve-needed[.nameservice])
when multiple
nameservices exist and random order should be enabled for specific
nameservices.

dfs.client.failover.resolver.impl
org.apache.hadoop.net.DNSDomainNameResolver

Determines what service the resolving will use from a given namenode domain
name
to specific namenode machine address. The config key can be appended with
an optional
nameservice ID (of form dfs.client.failover.resolver.impl[.nameservice])
when multiple
nameservices exist and random order should be enabled for specific
nameservices.

{code}
BTW. The added doc can be further improved by adding a section how to use.

Thanks.

was (Author: yzhangal):
Hi Guys,

I did one round of review, and it largely looks good to me, except for some
cosmetic things. Good work [~fengnanli]!

To save iterations, please consider addressing the question in #4 first before
having a new revision.

1. DomainNameResolver.java

The class name here is generic, however, the comment stated that this class
is for namenode. The jira also talked about router (for RBF). Suggest to change
the comment "for the failover proxy to get IP addresses for the namenode" to
"for failover proxies (for HA NameNodes, RBF routers etc) to get IP addresses
of the associated servers"
{code:java}
/**

* This interface provides methods for the failover proxy to get IP addresses

* for the namenode. Implementations will use their own service discovery

* mechanism, DNS, Zookeeper etc

public interface DomainNameResolver {
{code}
2. code-default.xml
The description can be changed to
"The implementation of DomainNameResolver used for service (HA NameNodes,
RBF Routers etc) discovery. The default

[jira] [Comment Edited] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-21 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773661#comment-16773661
 ] 

Yongjun Zhang edited comment on HDFS-14118 at 2/21/19 3:18 PM:
---

Hi Guys,

I did one round of review, and it largely looks good to me, except for some 
cosmetic things. Good work [~fengnanli]!

To save iterations, please consider addressing the question in #4 first before 
having a new revision.

1. DomainNameResolver.java

The class name here is generic, however, the comment stated that this class
 is for namenode. The jira also talked about router (for RBF). Suggest to change
 the comment "for the failover proxy to get IP addresses for the namenode" to
 "for failover proxies (for HA NameNodes, RBF routers etc) to get IP addresses
 of the associated servers"
{code:java}
/** 
  
 * This interface provides methods for the failover proxy to get IP addresses   
  
 * for the namenode. Implementations will use their own service discovery   
  
 * mechanism, DNS, Zookeeper etc
  
 */ 
  
public interface DomainNameResolver {
{code}
2. code-default.xml
 The description can be changed to 
 "The implementation of DomainNameResolver used for service (HA NameNodes,
 RBF Routers etc) discovery. The default implementation 
 org.apache.hadoop.net.DNSDomainNameResolver returns all IP addresses associated
 with the input domain name of the services by querying the underlying DNS."

3. AbstractNNFailoverProxyProvider.java:
{code:java}
 String host = nameNodeUri.getHost();
 String configKeyWithHost =
HdfsClientConfigKeys.Failover.RESOLVE_ADDRESS_NEEDED_KEY  + "." + host;
 boolean resolveNeeded = conf.getBoolean(configKeyWithHost,
HdfsClientConfigKeys.Failover.RESOLVE_ADDRESS_NEEDED_DEFAULT);  
{code}
Most of the time the 'host' here is NN, router etc nameservice instead of host 
name.
 We can change the variable 'host' to 'nameservice' (and 'configKeyWithHost' to
 'configKeyWithNameserivce') OR just add a comment here like:
 // 'host' here would be the name of a nameservice when address resolving
 // is needed.
 to make it easier to read

4. hdfs-default.xml

Suggest to make the following change (some typo, and some are re-arrangement). 
Also
 would you please explain when and why "random order should be enabled" and 
when it
 it's not needed? It seems not clear here.
{code:java}

  dfs.client.failover.resolve-needed
  false
  
Determines if the given namenode address is a domain name which needs to be
resolved (using the resolver configured by 
dfs.client.failover.resolver-impl).
This adds a transparency layer in the client so physical namenode address
can change without changing the client. The config key can be appended with 
an optional
nameservice ID (of form dfs.client.failover.resolve-needed[.nameservice]) 
when multiple
nameservices exist and random order should be enabled for specific 
nameservices.
  



  dfs.client.failover.resolver.impl
  org.apache.hadoop.net.DNSDomainNameResolver
  
Determines what service the resolving will use from a given namenode domain 
name
to specific namenode machine address. The config key can be appended with 
an optional
nameservice ID (of form dfs.client.failover.resolver.impl[.nameservice]) 
when multiple
nameservices exist and random order should be enabled for specific
nameservices.
  

{code}
BTW. The added doc can be further improved by adding a section how to use.

Thanks.


was (Author: yzhangal):
Hi Guys,

I did one round of review, and it largely looks good to me, except for some 
cosmetic things. Good work [~fengnanli]!

1. DomainNameResolver.java

The class name here is generic, however, the comment stated that this class
 is for namenode. The jira also talked about router (for RBF). Suggest to change
 the comment "for the failover proxy to get IP addresses for the namenode" to
 "for failover proxies (for HA NameNodes, RBF routers etc) to get IP addresses
 of the associated servers"
{code:java}
/** 
  
 * This interface provides methods for the failover proxy to get IP addresses   
  
 * for the namenode. Implementations will use their own service discovery   
  
 * mechanism, DNS, Zookeeper etc
  
 */ 
  
public interface DomainNameResolver {
{code}
2. code-default.xml
 The description can be changed to 
 "The implementation of DomainNameResolver used for service (HA NameNodes,
 RBF Routers etc) discovery. The

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-20 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773661#comment-16773661
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Hi Guys,

I did one round of review, and it largely looks good to me, except for some 
cosmetic things. Good work [~fengnanli]!

1. DomainNameResolver.java

The class name here is generic, however, the comment stated that this class
 is for namenode. The jira also talked about router (for RBF). Suggest to change
 the comment "for the failover proxy to get IP addresses for the namenode" to
 "for failover proxies (for HA NameNodes, RBF routers etc) to get IP addresses
 of the associated servers"
{code:java}
/** 
  
 * This interface provides methods for the failover proxy to get IP addresses   
  
 * for the namenode. Implementations will use their own service discovery   
  
 * mechanism, DNS, Zookeeper etc
  
 */ 
  
public interface DomainNameResolver {
{code}
2. code-default.xml
 The description can be changed to 
 "The implementation of DomainNameResolver used for service (HA NameNodes,
 RBF Routers etc) discovery. The default implementation 
 org.apache.hadoop.net.DNSDomainNameResolver returns all IP addresses associated
 with the input domain name of the services by querying the underlying DNS."

3. AbstractNNFailoverProxyProvider.java:
{code:java}
 String host = nameNodeUri.getHost();
 String configKeyWithHost =
HdfsClientConfigKeys.Failover.RESOLVE_ADDRESS_NEEDED_KEY  + "." + host;
 boolean resolveNeeded = conf.getBoolean(configKeyWithHost,
HdfsClientConfigKeys.Failover.RESOLVE_ADDRESS_NEEDED_DEFAULT);  
{code}
Most of the time the 'host' here is NN, router etc nameservice instead of host 
name.
 We can change the variable 'host' to 'nameservice' (and 'configKeyWithHost' to
 'configKeyWithNameserivce') OR just add a comment here like:
 // 'host' here would be the name of a nameservice when address resolving
 // is needed.
 to make it easier to read

4. hdfs-default.xml

Suggest to make the following change (some typo, and some are re-arrangement). 
Also
 would you please explain when and why "random order should be enabled" and 
when it
 it's not needed? It seems not clear here.
{code:java}

  dfs.client.failover.resolve-needed
  false
  
Determines if the given namenode address is a domain name which needs to be
resolved (using the resolver configured by 
dfs.client.failover.resolver-impl).
This adds a transparency layer in the client so physical namenode address
can change without changing the client. The config key can be appended with 
an optional
nameservice ID (of form dfs.client.failover.resolve-needed[.nameservice]) 
when multiple
nameservices exist and random order should be enabled for specific 
nameservices.
  



  dfs.client.failover.resolver.impl
  org.apache.hadoop.net.DNSDomainNameResolver
  
Determines what service the resolving will use from a given namenode domain 
name
to specific namenode machine address. The config key can be appended with 
an optional
nameservice ID (of form dfs.client.failover.resolver.impl[.nameservice]) 
when multiple
nameservices exist and random order should be enabled for specific
nameservices.
  

{code}
BTW. The added doc can be further improved by adding a section how to use.

Thanks.

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-20 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773629#comment-16773629
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Thanks [~goirix] for pointing out the doc list, and thanks [~fengnanli] for 
providing the doc. I am ok with creating a new task to update the documents. 
But let me finish one round of review and update here soon. Thanks.




> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS design doc_ Single domain name for 
> clients - Google Docs.pdf, HDFS-14118.001.patch, HDFS-14118.002.patch, 
> HDFS-14118.003.patch, HDFS-14118.004.patch, HDFS-14118.005.patch, 
> HDFS-14118.006.patch, HDFS-14118.007.patch, HDFS-14118.008.patch, 
> HDFS-14118.009.patch, HDFS-14118.010.patch, HDFS-14118.011.patch, 
> HDFS-14118.012.patch, HDFS-14118.013.patch, HDFS-14118.014.patch, 
> HDFS-14118.015.patch, HDFS-14118.016.patch, HDFS-14118.017.patch, 
> HDFS-14118.018.patch, HDFS-14118.019.patch, HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-20 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773337#comment-16773337
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Hi [~fengnanli], 

For the benefit of the community, would you please provide a one or two page 
description about this feature and attach to the jira:

1. Issue to address, case scenarios, better with an example(s)
2. You mentioned multiple approaches in the jira description, which approach 
you implemented? Please try to provide a bit detail about each approach, pros 
and cons. Seems you implemented "Another way is to have the client do the 
normal connecting/failover after they get the list of routers, which requires 
the change of current failover proxy provider.". But the jira description is 
not clear about that.
3. Some concrete use scenario examples to explain how your solution works?

Thanks.

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS-14118.001.patch, 
> HDFS-14118.002.patch, HDFS-14118.003.patch, HDFS-14118.004.patch, 
> HDFS-14118.005.patch, HDFS-14118.006.patch, HDFS-14118.007.patch, 
> HDFS-14118.008.patch, HDFS-14118.009.patch, HDFS-14118.010.patch, 
> HDFS-14118.011.patch, HDFS-14118.012.patch, HDFS-14118.013.patch, 
> HDFS-14118.014.patch, HDFS-14118.015.patch, HDFS-14118.016.patch, 
> HDFS-14118.017.patch, HDFS-14118.018.patch, HDFS-14118.019.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-20 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773337#comment-16773337
 ] 

Yongjun Zhang edited comment on HDFS-14118 at 2/20/19 7:58 PM:
---

Hi [~fengnanli], 

For the benefit of the community, would you please provide a one or two page 
description about this feature and attach to the jira:

1. Issue to address, case scenarios, better with an example(s)
2. You mentioned multiple approaches in the jira description, which approach 
you implemented? Please try to provide a bit detail about each approach, pros 
and cons. Seems you implemented "Another way is to have the client do the 
normal connecting/failover after they get the list of routers, which requires 
the change of current failover proxy provider.". But the jira description is 
not clear about that.
3. How to use this feature (class to provide, configurations to set etc)
4. Some concrete use scenario examples to explain how your solution works, and 
how people may extend it to address new need (possible new/different class 
implementation)

This will help reviewers, users, and future documentation.

Thanks.


was (Author: yzhangal):
Hi [~fengnanli], 

For the benefit of the community, would you please provide a one or two page 
description about this feature and attach to the jira:

1. Issue to address, case scenarios, better with an example(s)
2. You mentioned multiple approaches in the jira description, which approach 
you implemented? Please try to provide a bit detail about each approach, pros 
and cons. Seems you implemented "Another way is to have the client do the 
normal connecting/failover after they get the list of routers, which requires 
the change of current failover proxy provider.". But the jira description is 
not clear about that.
3. Some concrete use scenario examples to explain how your solution works?

Thanks.

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS-14118.001.patch, 
> HDFS-14118.002.patch, HDFS-14118.003.patch, HDFS-14118.004.patch, 
> HDFS-14118.005.patch, HDFS-14118.006.patch, HDFS-14118.007.patch, 
> HDFS-14118.008.patch, HDFS-14118.009.patch, HDFS-14118.010.patch, 
> HDFS-14118.011.patch, HDFS-14118.012.patch, HDFS-14118.013.patch, 
> HDFS-14118.014.patch, HDFS-14118.015.patch, HDFS-14118.016.patch, 
> HDFS-14118.017.patch, HDFS-14118.018.patch, HDFS-14118.019.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14118) Use DNS to resolve Namenodes and Routers

2019-02-19 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772070#comment-16772070
 ] 

Yongjun Zhang commented on HDFS-14118:
--

Hi [~fengnanli], Thanks for the work here. Sorry for my late response, I'm 
taking a look and will get back asap. 

> Use DNS to resolve Namenodes and Routers
> 
>
> Key: HDFS-14118
> URL: https://issues.apache.org/jira/browse/HDFS-14118
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
> Attachments: DNS testing log, HDFS-14118.001.patch, 
> HDFS-14118.002.patch, HDFS-14118.003.patch, HDFS-14118.004.patch, 
> HDFS-14118.005.patch, HDFS-14118.006.patch, HDFS-14118.007.patch, 
> HDFS-14118.008.patch, HDFS-14118.009.patch, HDFS-14118.010.patch, 
> HDFS-14118.011.patch, HDFS-14118.012.patch, HDFS-14118.013.patch, 
> HDFS-14118.014.patch, HDFS-14118.015.patch, HDFS-14118.016.patch, 
> HDFS-14118.017.patch, HDFS-14118.018.patch, HDFS-14118.019.patch, 
> HDFS-14118.patch
>
>
> Clients will need to know about routers to talk to the HDFS cluster 
> (obviously), and having routers updating (adding/removing) will have to make 
> every client change, which is a painful process.
> DNS can be used here to resolve the single domain name clients knows to a 
> list of routers in the current config. However, DNS won't be able to consider 
> only resolving to the working router based on certain health thresholds.
> There are some ways about how this can be solved. One way is to have a 
> separate script to regularly check the status of the router and update the 
> DNS records if a router fails the health thresholds. In this way, security 
> might be carefully considered for this way. Another way is to have the client 
> do the normal connecting/failover after they get the list of routers, which 
> requires the change of current failover proxy provider.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13101) Yet another fsimage corruption related to snapshot

2019-01-09 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738485#comment-16738485
 ] 

Yongjun Zhang commented on HDFS-13101:
--

Hi All, Thanks for continuing to drive the issue. Sorry I missed 
[~jojochuang]'s request to reassign to [~smeng] earlier. Just did it. Since 
[~adam.antal] is contributing, you guys can co-contribute to this jira. Thanks.

> Yet another fsimage corruption related to snapshot
> --
>
> Key: HDFS-13101
> URL: https://issues.apache.org/jira/browse/HDFS-13101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Assignee: Siyao Meng
>Priority: Major
> Attachments: HDFS-13101.001.patch
>
>
> Lately we saw case similar to HDFS-9406, even though HDFS-9406 fix is 
> present, so it's likely another case not covered by the fix. We are currently 
> trying to collect good fsimage + editlogs to replay to reproduce it and 
> investigate. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-13101) Yet another fsimage corruption related to snapshot

2019-01-09 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-13101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang reassigned HDFS-13101:


Assignee: Siyao Meng  (was: Yongjun Zhang)

> Yet another fsimage corruption related to snapshot
> --
>
> Key: HDFS-13101
> URL: https://issues.apache.org/jira/browse/HDFS-13101
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Assignee: Siyao Meng
>Priority: Major
> Attachments: HDFS-13101.001.patch
>
>
> Lately we saw case similar to HDFS-9406, even though HDFS-9406 fix is 
> present, so it's likely another case not covered by the fix. We are currently 
> trying to collect good fsimage + editlogs to replay to reproduce it and 
> investigate. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14015) Improve error handling in hdfsThreadDestructor in native thread local storage

2018-11-02 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16673722#comment-16673722
 ] 

Yongjun Zhang commented on HDFS-14015:
--

Thanks [~templedf], good work! 

It seems good to me, but couple of things:

1. In addition to getName(), I hope to get the thread id. See code from class 
Thread:
{code}
   public String toString() {
ThreadGroup group = getThreadGroup();
if (group != null) {
return "Thread[" + getName() + "," + getPriority() + "," +
   group.getName() + "]";
} else {
return "Thread[" + getName() + "," + getPriority() + "," +
"" + "]";
}
}
{code}
and
{code}
 public long getId() {
return tid;
}
{code}

2. Suggest to define 256 as a constant.

3. I wish we can test the output. But I know it's not easy to reproduce, thus 
not easy to test. I wonder if it can be tested with an OOM error within java 
world.

Thanks.



> Improve error handling in hdfsThreadDestructor in native thread local storage
> -
>
> Key: HDFS-14015
> URL: https://issues.apache.org/jira/browse/HDFS-14015
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: native
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: HDFS-14015.001.patch, HDFS-14015.002.patch, 
> HDFS-14015.003.patch, HDFS-14015.004.patch, HDFS-14015.005.patch, 
> HDFS-14015.006.patch, HDFS-14015.007.patch
>
>
> In the hdfsThreadDestructor() function, we ignore the return value from the 
> DetachCurrentThread() call.  We are seeing cases where a native thread dies 
> while holding a JVM monitor, and it doesn't release the monitor.  We're 
> hoping that logging this error instead of ignoring it will shed some light on 
> the issue.  In any case, it's good programming practice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14015) Improve error handling in hdfsThreadDestructor in native thread local storage

2018-11-01 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672275#comment-16672275
 ] 

Yongjun Zhang commented on HDFS-14015:
--

FYI [~templedf],

Thinking about that we could even try to print out the jstack when 
DetachCurrentThread is called, I found the following:

[http://barbie.uta.edu/~jli/Resources/Resource%20Provisoning%20Evaluation/85.pdf]
{code:java}

detach:
 if ((*env)->ExceptionOccurred(env)) {
 (*env)->ExceptionDescribe(env);
 }
 (*jvm)->DetachCurrentThread(jvm);
 }
{code}
This example code has the same problem of not checking the return status of 
DetachCurrentThread, but it tries to print jstack if exception occurred before 
DetachCurrentThread is called.

[https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html]

{quote}

ExceptionDescribe

{{void ExceptionDescribe(JNIEnv *env);}}

Prints an exception and a backtrace of the stack to a system error-reporting 
channel, such as {{stderr}}. This is a convenience routine provided for 
debugging.

{quote}

So I think we can do something similiar in the patch, in addition to check the 
return value of DetachCurrentThread. And I think we can do something like this:
{code:java}
 ..
 if ((*env)->ExceptionOccurred(env)) {
   fprintf(stderr, "hdfsThreadDestructor: GetJavaVM failed with"
   " error %d before DetachCurrentThread is called\n", ret);
   (*env)->ExceptionDescribe(env);
 }
 jint ret = (*jvm)->DetachCurrentThread(jvm);
 if (ret) {
   fprintf(stderr, "hdfsThreadDestructor: DetachCurrentThread failed with"
   " error %d\n", ret);
   if ((*env)->ExceptionOccurred(env)) {
 fprintf(stderr, "hdfsThreadDestructor: DetachCurrentThread failed with"
 " error %d, and exception occurred.\n", ret);
 (*env)->ExceptionDescribe(env);
 }

 {code}
Thanks.

 

> Improve error handling in hdfsThreadDestructor in native thread local storage
> -
>
> Key: HDFS-14015
> URL: https://issues.apache.org/jira/browse/HDFS-14015
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: native
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: HDFS-14015.001.patch, HDFS-14015.002.patch, 
> HDFS-14015.003.patch, HDFS-14015.004.patch, HDFS-14015.005.patch, 
> HDFS-14015.006.patch
>
>
> In the hdfsThreadDestructor() function, we ignore the return value from the 
> DetachCurrentThread() call.  We are seeing cases where a native thread dies 
> while holding a JVM monitor, and it doesn't release the monitor.  We're 
> hoping that logging this error instead of ignoring it will shed some light on 
> the issue.  In any case, it's good programming practice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Issue Comment Deleted] (HDFS-14015) Improve error handling in hdfsThreadDestructor in native thread local storage

2018-10-29 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-14015:
-
Comment: was deleted

(was: I don't see  an API provided by jni to get the current thread id, but I 
saw one here: 

[https://stackoverflow.com/questions/11224394/obtaining-the-thread-id-for-java-threads-in-linux]

If it's too much hassle to include in this jira, please feel free to postpone 
that to a new jira.

Thanks.

 

 )

> Improve error handling in hdfsThreadDestructor in native thread local storage
> -
>
> Key: HDFS-14015
> URL: https://issues.apache.org/jira/browse/HDFS-14015
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: native
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: HDFS-14015.001.patch, HDFS-14015.002.patch, 
> HDFS-14015.003.patch, HDFS-14015.004.patch, HDFS-14015.005.patch, 
> HDFS-14015.006.patch
>
>
> In the hdfsThreadDestructor() function, we ignore the return value from the 
> DetachCurrentThread() call.  We are seeing cases where a native thread dies 
> while holding a JVM monitor, and it doesn't release the monitor.  We're 
> hoping that logging this error instead of ignoring it will shed some light on 
> the issue.  In any case, it's good programming practice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14015) Improve error handling in hdfsThreadDestructor in native thread local storage

2018-10-29 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667873#comment-16667873
 ] 

Yongjun Zhang commented on HDFS-14015:
--

I don't see  an API provided by jni to get the current thread id, but I saw one 
here: 

[https://stackoverflow.com/questions/11224394/obtaining-the-thread-id-for-java-threads-in-linux]

If it's too much hassle to include in this jira, please feel free to postpone 
that to a new jira.

Thanks.

 

 

> Improve error handling in hdfsThreadDestructor in native thread local storage
> -
>
> Key: HDFS-14015
> URL: https://issues.apache.org/jira/browse/HDFS-14015
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: native
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: HDFS-14015.001.patch, HDFS-14015.002.patch, 
> HDFS-14015.003.patch, HDFS-14015.004.patch, HDFS-14015.005.patch, 
> HDFS-14015.006.patch
>
>
> In the hdfsThreadDestructor() function, we ignore the return value from the 
> DetachCurrentThread() call.  We are seeing cases where a native thread dies 
> while holding a JVM monitor, and it doesn't release the monitor.  We're 
> hoping that logging this error instead of ignoring it will shed some light on 
> the issue.  In any case, it's good programming practice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14012) Add diag info in RetryInvocationHandler

2018-10-29 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667870#comment-16667870
 ] 

Yongjun Zhang commented on HDFS-14012:
--

Thanks [~dineshchitlangia], good to see that it's fixed in trunk. I must have 
been on an old branch.

Sorry for getting back late.

 

 

> Add diag info in RetryInvocationHandler
> ---
>
> Key: HDFS-14012
> URL: https://issues.apache.org/jira/browse/HDFS-14012
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Yongjun Zhang
>Assignee: Dinesh Chitlangia
>Priority: Major
>
> RetryInvocationHandler does the following logging:
> {code:java}
> } else { 
>   LOG.warn("A failover has occurred since the start of this method" + " 
> invocation attempt."); 
> }{code}
> Would be helpful to report the method name, and call stack in this message.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14015) Improve error handling in hdfsThreadDestructor in native thread local storage

2018-10-29 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-14015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667791#comment-16667791
 ] 

Yongjun Zhang commented on HDFS-14015:
--

Hi [~templedf],

Thanks for reporting and working on this issue. I took a quick look at the 
patch, I think we should consider including some thread information when 
reporting failure. For example, instead of saying "Unable to detach thread 
...", it's better to say "Unable to detach thread  ...", where  can 
be the thread id of the current thread. This would help us diagnose which 
thread has problem.

Thanks.

> Improve error handling in hdfsThreadDestructor in native thread local storage
> -
>
> Key: HDFS-14015
> URL: https://issues.apache.org/jira/browse/HDFS-14015
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: native
>Affects Versions: 3.0.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>Priority: Major
> Attachments: HDFS-14015.001.patch, HDFS-14015.002.patch, 
> HDFS-14015.003.patch, HDFS-14015.004.patch, HDFS-14015.005.patch, 
> HDFS-14015.006.patch
>
>
> In the hdfsThreadDestructor() function, we ignore the return value from the 
> DetachCurrentThread() call.  We are seeing cases where a native thread dies 
> while holding a JVM monitor, and it doesn't release the monitor.  We're 
> hoping that logging this error instead of ignoring it will shed some light on 
> the issue.  In any case, it's good programming practice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14012) Add diag info in RetryInvocationHandler

2018-10-20 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-14012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-14012:
-
Environment: (was: RetryInvocationHandler does the following logging:
{code:java}
} else {
LOG.warn("A failover has occurred since the start of this method"
+ " invocation attempt.");
}{code}
Would be helpful to report the method name, and call stack in this message.

Thanks.)
Description: 
RetryInvocationHandler does the following logging:
{code:java}
} else { 
  LOG.warn("A failover has occurred since the start of this method" + " 
invocation attempt."); 
}{code}
Would be helpful to report the method name, and call stack in this message.

Thanks.

> Add diag info in RetryInvocationHandler
> ---
>
> Key: HDFS-14012
> URL: https://issues.apache.org/jira/browse/HDFS-14012
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Yongjun Zhang
>Priority: Major
>
> RetryInvocationHandler does the following logging:
> {code:java}
> } else { 
>   LOG.warn("A failover has occurred since the start of this method" + " 
> invocation attempt."); 
> }{code}
> Would be helpful to report the method name, and call stack in this message.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-14012) Add diag info in RetryInvocationHandler

2018-10-20 Thread Yongjun Zhang (JIRA)

Yongjun Zhang created HDFS-14012:


 Summary: Add diag info in RetryInvocationHandler
 Key: HDFS-14012
 URL: https://issues.apache.org/jira/browse/HDFS-14012
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client
 Environment: RetryInvocationHandler does the following logging:
{code:java}
} else {
LOG.warn("A failover has occurred since the start of this method"
+ " invocation attempt.");
}{code}
Would be helpful to report the method name, and call stack in this message.

Thanks.
Reporter: Yongjun Zhang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13031) To detect fsimage corruption on the spot

2018-08-10 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-13031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575833#comment-16575833
 ] 

Yongjun Zhang commented on HDFS-13031:
--

Thanks [~adam.antal] and [~smeng].

Good summary!

The OIV tool may do things differently than NN itself, and using NN to load 
fsimage to verify is the real full checking of the fsimage (what I proposed in 
this jira). But I agree that if feasible, add --verify to OIV could detect the 
problems we have seen so far. Or we can even call it --detectcorruption.

That said, action (quit SNN etc) need to be taken after detecting fsimage 
corruption. I think HDFS-13314 and HDFS-13813 are good complementary solution.

 

 

 

 

> To detect fsimage corruption on the spot
> 
>
> Key: HDFS-13031
> URL: https://issues.apache.org/jira/browse/HDFS-13031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
> Environment:  
>Reporter: Yongjun Zhang
>Assignee: Adam Antal
>Priority: Major
>
> Since we fixed HDFS-9406, there are new cases reported from the field that 
> similar fsimage corruption happens. We need good fsimage + editlogs to replay 
> to reproduce the corruption. However, usually when the corruption is detected 
> (at later NN restart), the good fsimage is already deleted.
> We need to have a way to detect fsimage corruption on the spot. Currently 
> what I think we could do is:
>  # after SNN creates a new fsimage, it spawn a new modified NN process (NN 
> with some new command line args) to just load the fsimage and do nothing 
> else. 
>  # If the process failed, the currently running SNN will do either a) backup 
> the fsimage + editlogs or b) no longer do checkpointing. And it need to 
> somehow raise a flag to user that the fsimage is corrupt.
> In step 2, if we do a, we need to introduce new NN->JN API to backup 
> editlogs; if we do b, it changes SNN's behavior, and kind of not compatible. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13813) Exit NameNode when dangling child inode is detected when saving FsImage

2018-08-09 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-13813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575408#comment-16575408
 ] 

Yongjun Zhang commented on HDFS-13813:
--

Thanks [~smeng], good work here!

I did not review the patch, but have a general comment, the checking done in 
HDFS-13314, plus the checking done here, are for debugging. I hope we can have 
a follow-up Jira to use a configuration parameter to control the 
enabling/disabling of the checking, as an optimization. 

 

 

> Exit NameNode when dangling child inode is detected when saving FsImage
> ---
>
> Key: HDFS-13813
> URL: https://issues.apache.org/jira/browse/HDFS-13813
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.0.3
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
> Attachments: HDFS-13813.001.patch
>
>
> Recently, the same stack trace as in -HDFS-9406- appears again in the field. 
> The symptom of the problem is that *loadINodeDirectorySection()* can't find a 
> child inode in inodeMap by the node id in the children list of the directory. 
> The child inode could be missing or deleted.
> As for now we didn't have a clear trace to reproduce the problem. Therefore, 
> I'm proposing this improvement to detect such corruption (data structure 
> inconsistency) when saving the FsImage, so that we can have the FsImage and 
> Edit Log to potentially stably reproduce the problem.
>  
> In a previous patch HDFS-13314, [~arpitagarwal] did a great job catching 
> potential FsImage corruption in two cases. Further, this patch would detect 
> if a child inode exist in the global FSDirectory dir when saving 
> (serializing) INodeDirectorySection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-8131) Implement a space balanced block placement policy

2018-07-26 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558894#comment-16558894
 ] 

Yongjun Zhang commented on HDFS-8131:
-

I just read HDFS-4946, I found it doesn't exactly do what I meant by comment #3 
above.

HDFS-4946 introduced a config to disable/enable preferLocalDN, if disabled, the 
localDN will be skipped for all application.  

Whereas when I wrote comment #3 above, I was thinking that when choosing the 
first DN, we could apply the same fix done here in HDFS-8131, such that we can 
choose either local or remote for the first DN, instead of always skipping the 
local DN.

Welcome to comment on this thought.

 

 

> Implement a space balanced block placement policy
> -
>
> Key: HDFS-8131
> URL: https://issues.apache.org/jira/browse/HDFS-8131
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Liu Shaohui
>Assignee: Liu Shaohui
>Priority: Minor
>  Labels: BlockPlacementPolicy
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HDFS-8131-branch-2.7.patch, HDFS-8131-v1.diff, 
> HDFS-8131-v2.diff, HDFS-8131-v3.diff, HDFS-8131.004.patch, 
> HDFS-8131.005.patch, HDFS-8131.006.patch, balanced.png
>
>
> The default block placement policy will choose datanodes for new blocks 
> randomly, which will result in unbalanced space used percent among datanodes 
> after an cluster expansion. The old datanodes always are in high used percent 
> of space and new added ones are in low percent.
> Through we can used the external balance tool to balance the space used rate, 
> it will cost extra network IO and it's not easy to control the balance speed.
> An easy solution is to implement an balanced block placement policy which 
> will choose low used percent datanodes for new blocks with a little high 
> possibility. In a not long term, the used percent of datanodes will trend to 
> be balanced.
> Suggestions and discussions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-8131) Implement a space balanced block placement policy

2018-07-26 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558746#comment-16558746
 ] 

Yongjun Zhang commented on HDFS-8131:
-

Hm, just noticed HDFS-4946 for my comment #3 above. 

Thanks.

> Implement a space balanced block placement policy
> -
>
> Key: HDFS-8131
> URL: https://issues.apache.org/jira/browse/HDFS-8131
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Liu Shaohui
>Assignee: Liu Shaohui
>Priority: Minor
>  Labels: BlockPlacementPolicy
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HDFS-8131-branch-2.7.patch, HDFS-8131-v1.diff, 
> HDFS-8131-v2.diff, HDFS-8131-v3.diff, HDFS-8131.004.patch, 
> HDFS-8131.005.patch, HDFS-8131.006.patch, balanced.png
>
>
> The default block placement policy will choose datanodes for new blocks 
> randomly, which will result in unbalanced space used percent among datanodes 
> after an cluster expansion. The old datanodes always are in high used percent 
> of space and new added ones are in low percent.
> Through we can used the external balance tool to balance the space used rate, 
> it will cost extra network IO and it's not easy to control the balance speed.
> An easy solution is to implement an balanced block placement policy which 
> will choose low used percent datanodes for new blocks with a little high 
> possibility. In a not long term, the used percent of datanodes will trend to 
> be balanced.
> Suggestions and discussions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-8131) Implement a space balanced block placement policy

2018-07-26 Thread Yongjun Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/HDFS-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558739#comment-16558739
 ] 

Yongjun Zhang commented on HDFS-8131:
-

HI [~liushaohui],

Thanks much for the nice work here.

I have some comments.

1. This jira is described as "improvement" rather than new feature, it should 
be a new feature and be documented. 

2. A question related to the question [~Tagar] asked above:

https://issues.apache.org/jira/browse/HDFS-8131?focusedCommentId=15981732=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15981732

Class AvailableSpaceBlockPlacementPolicy extends BlockPlacementPolicyDefault. 
But it doesn't change the behavior of choosing the first node in 
BlockPlacementPolicyDefault, so even with this new feature, the local DN is 
always chosen as the first DN (of course when it is not excluded), and the new 
feature only changes the selection of the rest of the two DNs. 

3. Wonder if we could have another placement policy that could potentially have 
a choice to choose a different DN than local DN for the first node, so we don't 
always choose the local DN as the first node.

Would you please share your thoughts?

Thanks.


> Implement a space balanced block placement policy
> -
>
> Key: HDFS-8131
> URL: https://issues.apache.org/jira/browse/HDFS-8131
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Liu Shaohui
>Assignee: Liu Shaohui
>Priority: Minor
>  Labels: BlockPlacementPolicy
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha1
>
> Attachments: HDFS-8131-branch-2.7.patch, HDFS-8131-v1.diff, 
> HDFS-8131-v2.diff, HDFS-8131-v3.diff, HDFS-8131.004.patch, 
> HDFS-8131.005.patch, HDFS-8131.006.patch, balanced.png
>
>
> The default block placement policy will choose datanodes for new blocks 
> randomly, which will result in unbalanced space used percent among datanodes 
> after an cluster expansion. The old datanodes always are in high used percent 
> of space and new added ones are in low percent.
> Through we can used the external balance tool to balance the space used rate, 
> it will cost extra network IO and it's not easy to control the balance speed.
> An easy solution is to implement an balanced block placement policy which 
> will choose low used percent datanodes for new blocks with a little high 
> possibility. In a not long term, the used percent of datanodes will trend to 
> be balanced.
> Suggestions and discussions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-13663) Should throw exception when incorrect block size is set

2018-06-07 Thread Yongjun Zhang (JIRA)

Yongjun Zhang created HDFS-13663:


 Summary: Should throw exception when incorrect block size is set
 Key: HDFS-13663
 URL: https://issues.apache.org/jira/browse/HDFS-13663
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Yongjun Zhang


See

./hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java

{code}
void syncBlock(List syncList) throws IOException {


   newBlock.setNumBytes(finalizedLength);
break;
  case RBW:
  case RWR:
long minLength = Long.MAX_VALUE;
for(BlockRecord r : syncList) {
  ReplicaState rState = r.rInfo.getOriginalReplicaState();
  if(rState == bestState) {
minLength = Math.min(minLength, r.rInfo.getNumBytes());
participatingList.add(r);
  }
  if (LOG.isDebugEnabled()) {
LOG.debug("syncBlock replicaInfo: block=" + block +
", from datanode " + r.id + ", receivedState=" + rState.name() +
", receivedLength=" + r.rInfo.getNumBytes() + ", bestState=" +
bestState.name());
  }
}
// recover() guarantees syncList will have at least one replica with RWR
// or better state.
assert minLength != Long.MAX_VALUE : "wrong minLength"; <= should throw 
exception 
newBlock.setNumBytes(minLength);
break;
  case RUR:
  case TEMPORARY:
assert false : "bad replica state: " + bestState;
  default:
break; // we have 'case' all enum values
  }
{code}

when minLength is Long.MAX_VALUE, it should throw exception.

There might be other places like this.

Otherwise, we would see the following WARN in datanode log
{code}
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Can't replicate block xyz 
because on-disk length 11852203 is shorter than NameNode recorded length 
9223372036854775807
{code}
where 9223372036854775807 is Long.MAX_VALUE.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13632) Randomize baseDir for MiniJournalCluster in MiniQJMHACluster for TestDFSAdminWithHA

2018-05-31 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13632:
-
Fix Version/s: (was: 3.0.3)
   3.0.4

> Randomize baseDir for MiniJournalCluster in MiniQJMHACluster for 
> TestDFSAdminWithHA 
> 
>
> Key: HDFS-13632
> URL: https://issues.apache.org/jira/browse/HDFS-13632
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Anbang Hu
>Assignee: Anbang Hu
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.4
>
> Attachments: HDFS-13632-branch-2.000.patch, 
> HDFS-13632-branch-2.001.patch, HDFS-13632-branch-2.004.patch, 
> HDFS-13632.000.patch, HDFS-13632.001.patch, HDFS-13632.002.patch, 
> HDFS-13632.003.patch, HDFS-13632.004.patch
>
>
> As [HDFS-13630|https://issues.apache.org/jira/browse/HDFS-13630] indicates, 
> testUpgradeCommand keeps journalnode directory from being released, which 
> fails all subsequent tests that try to use the same path.
> Randomizing the baseDir for MiniJournalCluster in MiniQJMHACluster for 
> TestDFSAdminWithHA can isolate effects of tests from each other.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13591) TestDFSShell#testSetrepLow fails on Windows

2018-05-30 Thread Yongjun Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-13591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13591:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> TestDFSShell#testSetrepLow fails on Windows
> ---
>
> Key: HDFS-13591
> URL: https://issues.apache.org/jira/browse/HDFS-13591
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Anbang Hu
>Assignee: Anbang Hu
>Priority: Minor
>  Labels: Windows
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13591.000.patch, HDFS-13591.001.patch, 
> HDFS-13591.002.patch
>
>
> https://builds.apache.org/job/hadoop-trunk-win/469/testReport/org.apache.hadoop.hdfs/TestDFSShell/testSetrepLow/
>  shows
> {code:java}
> Error message is not the expected error message 
> expected:<...testFileForSetrepLow[]
> > but was:<...testFileForSetrepLow[
> ]
> >
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13068) RBF: Add router admin option to manage safe mode

2018-05-26 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13068:
-
Fix Version/s: (was: 3.0.1)
   3.0.3

> RBF: Add router admin option to manage safe mode
> 
>
> Key: HDFS-13068
> URL: https://issues.apache.org/jira/browse/HDFS-13068
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: Yiqun Lin
>Priority: Major
>  Labels: RBF
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.3
>
> Attachments: HDFS-13068-branch-2.001.patch, 
> HDFS-13068-branch-3.0.001.patch, HDFS-13068.001.patch, HDFS-13068.002.patch, 
> HDFS-13068.003.patch
>
>
> HDFS-13044 adds a safe mode to reject requests. We should have an option to 
> manually set the Router into safe mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13115) In getNumUnderConstructionBlocks(), ignore the inodeIds for which the inodes have been deleted

2018-05-26 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13115:
-
Fix Version/s: (was: 3.0.1)
   3.0.3

> In getNumUnderConstructionBlocks(), ignore the inodeIds for which the inodes 
> have been deleted 
> ---
>
> Key: HDFS-13115
> URL: https://issues.apache.org/jira/browse/HDFS-13115
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 3.0.3
>
> Attachments: HDFS-13115.001.patch, HDFS-13115.002.patch
>
>
> In LeaseManager, 
> {code}
>  private synchronized INode[] getINodesWithLease() {
> List inodes = new ArrayList<>(leasesById.size());
> INode currentINode;
> for (long inodeId : leasesById.keySet()) {
>   currentINode = fsnamesystem.getFSDirectory().getInode(inodeId);
>   // A file with an active lease could get deleted, or its
>   // parent directories could get recursively deleted.
>   if (currentINode != null &&
>   currentINode.isFile() &&
>   !fsnamesystem.isFileDeleted(currentINode.asFile())) {
> inodes.add(currentINode);
>   }
> }
> return inodes.toArray(new INode[0]);
>   }
> {code}
> we can see that given an {{inodeId}},  
> {{fsnamesystem.getFSDirectory().getInode(inodeId)}} could return NULL . The 
> reason is explained in the comment.
> HDFS-12985 RCAed a case and solved that case, we saw that it fixes some 
> cases, but we are still seeing NullPointerException from FSnamesystem
> {code}
>   public long getCompleteBlocksTotal() {
> // Calculate number of blocks under construction
> long numUCBlocks = 0;
> readLock();
> try {
>   numUCBlocks = leaseManager.getNumUnderConstructionBlocks(); <=== here
>   return getBlocksTotal() - numUCBlocks;
> } finally {
>   readUnlock();
> }
>   }
> {code}
> The exception happens when the inode is removed for the given inodeid, see 
> LeaseManager code below:
> {code}
>   synchronized long getNumUnderConstructionBlocks() {
> assert this.fsnamesystem.hasReadLock() : "The FSNamesystem read lock 
> wasn't"
>   + "acquired before counting under construction blocks";
> long numUCBlocks = 0;
> for (Long id : getINodeIdWithLeases()) {
>   final INodeFile cons = 
> fsnamesystem.getFSDirectory().getInode(id).asFile(); <=== here
>   Preconditions.checkState(cons.isUnderConstruction());
>   BlockInfo[] blocks = cons.getBlocks();
>   if(blocks == null)
> continue;
>   for(BlockInfo b : blocks) {
> if(!b.isComplete())
>   numUCBlocks++;
>   }
> }
> LOG.info("Number of blocks under construction: " + numUCBlocks);
> return numUCBlocks;
>   }
> {code}
> Create this jira to add a check whether the inode is removed, as a safeguard, 
> to avoid the NullPointerException.
> Looks that after the inodeid is returned by {{getINodeIdWithLeases()}}, it 
> got deleted from FSDirectory map.
> Ideally we should find out who deleted it, like in HDFS-12985. 
> But it seems reasonable to me to have a safeguard here, like other code that 
> calls to {{fsnamesystem.getFSDirectory().getInode(id)}} in the code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13164) File not closed if streamer fail with DSQuotaExceededException

2018-05-26 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13164:
-
Fix Version/s: (was: 3.0.1)
   3.0.3

> File not closed if streamer fail with DSQuotaExceededException
> --
>
> Key: HDFS-13164
> URL: https://issues.apache.org/jira/browse/HDFS-13164
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 3.0.3
>
> Attachments: HDFS-13164.01.patch, HDFS-13164.02.patch, 
> HDFS-13164.branch-2.8.01.patch
>
>
>  This is found during yarn log aggregation but theoretically could happen to 
> any client.
> If the dir's space quota is exceeded, the following would happen when a file 
> is created:
>  - client {{startFile}} rpc to NN, gets a {{DFSOutputStream}}.
>  - writing to the stream would trigger the streamer to {{getAdditionalBlock}} 
> rpc to NN, which would get the DSQuotaExceededException
>  - client closes the stream
>   
>  The fact that this would leave a 0-sized (or whatever size left in the 
> quota) file in HDFS is beyond the scope of this jira. However, the file would 
> be left in openforwrite status (shown in {{fsck -openforwrite)}} at least, 
> and could potentially leak leaseRenewer too.
> This is because in the close implementation,
>  # {{isClosed}} is first checked, and the close call will be a no-op if 
> {{isClosed == true}}.
>  # {{flushInternal}} checks {{isClosed}}, and throws the exception right away 
> if true
> {{isClosed}} does this: {{return closed || getStreamer().streamerClosed;}}
> When the disk quota is reached, {{getAdditionalBlock}} will throw when the 
> streamer calls. Because the streamer runs in a separate thread, at the time 
> the client calls close on the stream, the streamer may or may not have 
> reached the Quota exception. If it has, then due to #1, the close call on the 
> stream will be no-op. If it hasn't, then due to #2 the {{completeFile}} logic 
> will be skipped.
> {code:java}
> protected synchronized void closeImpl() throws IOException {
> if (isClosed()) {
>   IOException e = lastException.getAndSet(null);
>   if (e == null)
> return;
>   else
> throw e;
> }
>   try {
> flushBuffer(); // flush from all upper layers
> ...
> flushInternal(); // flush all data to Datanodes
> // get last block before destroying the streamer
> ExtendedBlock lastBlock = getStreamer().getBlock();
> try (TraceScope ignored =
>dfsClient.getTracer().newScope("completeFile")) {
>completeFile(lastBlock);
> }
>} catch (ClosedChannelException ignored) {
>} finally {
>  closeThreads(true);
>}
>  }
>  {code}
> Log snippets:
> {noformat}
> 2018-02-16 15:59:32,916 DEBUG org.apache.hadoop.hdfs.DFSClient: DataStreamer 
> Quota Exception
> org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The DiskSpace quota 
> of /DIR is exceeded: quota = 200 B = 1.91 MB but diskspace consumed = 
> 404139552 B = 385.42 MB
> at 
> org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyDiskspaceQuota(DirectoryWithQuotaFeature.java:149)
> at 
> org.apache.hadoop.hdfs.server.namenode.DirectoryWithQuotaFeature.verifyQuota(DirectoryWithQuotaFeature.java:159)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyQuota(FSDirectory.java:2124)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.updateCount(FSDirectory.java:1991)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.updateCount(FSDirectory.java:1966)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addBlock(FSDirectory.java:463)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveAllocatedBlock(FSNamesystem.java:3896)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3484)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:686)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:217)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:506)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
> at

[jira] [Updated] (HDFS-13244) Add stack, conf, metrics links to utilities dropdown in NN webUI

2018-05-26 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13244:
-
Fix Version/s: (was: 3.0.1)
   3.0.3

> Add stack, conf, metrics links to utilities dropdown in NN webUI
> 
>
> Key: HDFS-13244
> URL: https://issues.apache.org/jira/browse/HDFS-13244
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Fix For: 3.1.0, 3.0.3
>
> Attachments: HDFS-13244.00.patch, Screen Shot 2018-03-07 at 11.28.27 
> AM.png
>
>
> Add stack, conf, metrics links to utilities dropdown in NN webUI 
> cc [~arpitagarwal] for suggesting this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13551) TestMiniDFSCluster#testClusterSetStorageCapacity does not shut down cluster

2018-05-26 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13551:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> TestMiniDFSCluster#testClusterSetStorageCapacity does not shut down cluster
> ---
>
> Key: HDFS-13551
> URL: https://issues.apache.org/jira/browse/HDFS-13551
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Anbang Hu
>Assignee: Anbang Hu
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13551.000.patch
>
>
> {color:#33}TestMiniDFSCluster#testClusterSetStorageCapacity not shutting 
> down cluster properly leads to{color}
> {color:#d04437}[INFO] Running org.apache.hadoop.hdfs.TestMiniDFSCluster{color}
> {color:#d04437}[ERROR] Tests run: 7, Failures: 0, Errors: 3, Skipped: 1, Time 
> elapsed: 136.409 s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.TestMiniDFSCluster{color}
> {color:#d04437}[ERROR] 
> testClusterNoStorageTypeSetForDatanodes(org.apache.hadoop.hdfs.TestMiniDFSCluster)
>  Time elapsed: 0.034 s <<< ERROR!{color}
> {color:#d04437}java.io.IOException: Could not fully delete 
> E:\OSS\hadoop-branch-2\hadoop-hdfs-project\hadoop-hdfs\target\test\data\dfs\name1{color}
> {color:#d04437} at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:1047){color}
> {color:#d04437} at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:883){color}
> {color:#d04437} at 
> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:514){color}
> {color:#d04437} at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:473){color}
> {color:#d04437} at 
> org.apache.hadoop.hdfs.TestMiniDFSCluster.testClusterNoStorageTypeSetForDatanodes(TestMiniDFSCluster.java:255){color}
> {color:#d04437} at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method){color}
> {color:#d04437} at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){color}
> {color:#d04437} at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){color}
> {color:#d04437} at java.lang.reflect.Method.invoke(Method.java:498){color}
> {color:#d04437} at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47){color}
> {color:#d04437} at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12){color}
> {color:#d04437} at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44){color}
> {color:#d04437} at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17){color}
> {color:#d04437} at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26){color}
> {color:#d04437} at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271){color}
> {color:#d04437} at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70){color}
> {color:#d04437} at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50){color}
> {color:#d04437} at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:238){color}
> {color:#d04437} at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63){color}
> {color:#d04437} at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236){color}
> {color:#d04437} at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:53){color}
> {color:#d04437} at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229){color}
> {color:#d04437} at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:309){color}
> {color:#d04437} at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365){color}
> {color:#d04437} at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273){color}
> {color:#d04437} at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238){color}
> {color:#d04437} at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159){color}
> {color:#d04437} at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:379){color}
> {color:#d04437} at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:340){color}
> {color:#d04437} at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125){color}
> {color:#d04437} at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:413){color}
> {color:#d04437}[ERROR] 
> testClusterSetDatanodeDifferentStorageType(org.apache.hadoop.hdfs.TestMiniDFSCluster)
>  Time elapsed: 0.023 s <<< ERROR!{color}
>

[jira] [Updated] (HDFS-13611) Unsafe use of Text as a ConcurrentHashMap key in PBHelperClient

2018-05-26 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13611:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> Unsafe use of Text as a ConcurrentHashMap key in PBHelperClient
> ---
>
> Key: HDFS-13611
> URL: https://issues.apache.org/jira/browse/HDFS-13611
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.0, 3.2.0, 3.1.1, 3.0.4
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.3
>
> Attachments: HDFS-13611.001.patch, HDFS-13611.002.patch
>
>
> Follow on to HDFS-13601, a bug spotted by [~tlipcon]: since Text is mutable, 
> it's not safe to use as a hash map key.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-12813) RequestHedgingProxyProvider can hide Exception thrown from the Namenode for proxy size of 1

2018-05-19 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481730#comment-16481730
 ] 

Yongjun Zhang commented on HDFS-12813:
--

Thanks you guys for the work here, I cherry-picked it to branch-3.0 which is 
currently for 3.0.3.


> RequestHedgingProxyProvider can hide Exception thrown from the Namenode for 
> proxy size of 1
> ---
>
> Key: HDFS-12813
> URL: https://issues.apache.org/jira/browse/HDFS-12813
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 3.0.3
>
> Attachments: HDFS-12813.001.patch, HDFS-12813.002.patch, 
> HDFS-12813.003.patch, HDFS-12813.004.patch
>
>
> HDFS-11395 fixed the problem where the MultiException thrown by 
> RequestHedgingProxyProvider was hidden. However when the target proxy size is 
> 1, then unwrapping is not done for the InvocationTargetException. for target 
> proxy size of 1, the unwrapping should be done till first level where as for 
> multiple proxy size, it should be done at 2 levels.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-12813) RequestHedgingProxyProvider can hide Exception thrown from the Namenode for proxy size of 1

2018-05-19 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-12813:
-
Fix Version/s: 3.0.3

> RequestHedgingProxyProvider can hide Exception thrown from the Namenode for 
> proxy size of 1
> ---
>
> Key: HDFS-12813
> URL: https://issues.apache.org/jira/browse/HDFS-12813
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 3.0.3
>
> Attachments: HDFS-12813.001.patch, HDFS-12813.002.patch, 
> HDFS-12813.003.patch, HDFS-12813.004.patch
>
>
> HDFS-11395 fixed the problem where the MultiException thrown by 
> RequestHedgingProxyProvider was hidden. However when the target proxy size is 
> 1, then unwrapping is not done for the InvocationTargetException. for target 
> proxy size of 1, the unwrapping should be done till first level where as for 
> multiple proxy size, it should be done at 2 levels.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13388) RequestHedgingProxyProvider calls multiple configured NNs all the time

2018-05-19 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481728#comment-16481728
 ] 

Yongjun Zhang commented on HDFS-13388:
--

Hi [~elgoiri],

Thanks for the reply. I just put both HDFS-12813 and HDFS-13388 to branch-3.0 
which is the branch for 3.0.3 release.


> RequestHedgingProxyProvider calls multiple configured NNs all the time
> --
>
> Key: HDFS-13388
> URL: https://issues.apache.org/jira/browse/HDFS-13388
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.3
>
> Attachments: HADOOP-13388.0001.patch, HADOOP-13388.0002.patch, 
> HADOOP-13388.0003.patch, HADOOP-13388.0004.patch, HADOOP-13388.0005.patch, 
> HADOOP-13388.0006.patch, HADOOP-13388.0007.patch, HADOOP-13388.0008.patch, 
> HADOOP-13388.0009.patch, HADOOP-13388.0010.patch, HADOOP-13388.0011.patch, 
> HADOOP-13388.0012.patch, HADOOP-13388.0013.patch, HADOOP-13388.0014.patch
>
>
> In HDFS-7858 RequestHedgingProxyProvider was designed to "first 
> simultaneously call multiple configured NNs to decide which is the active 
> Namenode and then for subsequent calls it will invoke the previously 
> successful NN ." But the current code call multiple configured NNs every time 
> even when we already got the successful NN. 
>  That's because in RetryInvocationHandler.java, ProxyDescriptor's member 
> proxyInfo is assigned only when it is constructed or when failover occurs. 
> RequestHedgingProxyProvider.currentUsedProxy is null in both cases, so the 
> only proxy we can get is always a dynamic proxy handled by 
> RequestHedgingInvocationHandler.class. RequestHedgingInvocationHandler.class 
> handles invoked method by calling multiple configured NNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13388) RequestHedgingProxyProvider calls multiple configured NNs all the time

2018-05-19 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13388:
-
Fix Version/s: 3.0.3

> RequestHedgingProxyProvider calls multiple configured NNs all the time
> --
>
> Key: HDFS-13388
> URL: https://issues.apache.org/jira/browse/HDFS-13388
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.3
>
> Attachments: HADOOP-13388.0001.patch, HADOOP-13388.0002.patch, 
> HADOOP-13388.0003.patch, HADOOP-13388.0004.patch, HADOOP-13388.0005.patch, 
> HADOOP-13388.0006.patch, HADOOP-13388.0007.patch, HADOOP-13388.0008.patch, 
> HADOOP-13388.0009.patch, HADOOP-13388.0010.patch, HADOOP-13388.0011.patch, 
> HADOOP-13388.0012.patch, HADOOP-13388.0013.patch, HADOOP-13388.0014.patch
>
>
> In HDFS-7858 RequestHedgingProxyProvider was designed to "first 
> simultaneously call multiple configured NNs to decide which is the active 
> Namenode and then for subsequent calls it will invoke the previously 
> successful NN ." But the current code call multiple configured NNs every time 
> even when we already got the successful NN. 
>  That's because in RetryInvocationHandler.java, ProxyDescriptor's member 
> proxyInfo is assigned only when it is constructed or when failover occurs. 
> RequestHedgingProxyProvider.currentUsedProxy is null in both cases, so the 
> only proxy we can get is always a dynamic proxy handled by 
> RequestHedgingInvocationHandler.class. RequestHedgingInvocationHandler.class 
> handles invoked method by calling multiple configured NNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-6489) DFS Used space is not correct computed on frequent append operations

2018-05-18 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480265#comment-16480265
 ] 

Yongjun Zhang commented on HDFS-6489:
-

HI Guys,

Thanks for working on this issue. I have gone through the discussions, would 
like to share my thoughts:

- The append operation tend to happen many times on the same block, this makes 
it easy to deny write due to incorrect DU estimation of append.
- [~cheersyang]'s approach tries to remedy the situation by "interrupts 
DURefreshThread and then evaluates the space again", however, it would be too 
slow to help ongoing writes.
- [~raviprak]'s approach tries not to include block size when incr DU when 
converting complete block to RBW (append). This might under estimate the disk 
usage, and cause over subscription of a DN capacity, and cause write to fail. 
The fs.du.interval is 10 minutes by default, the oversubscription (if any) can 
be corrected in 10 minutes (right Ravi?). So the chance of this failure is 
relatively low. 

Sounds like we can go with Ravi's solution if the accounting is corrected every 
10 minutes. Given that we don't have a perfect solution for this problem, and 
DU is an estimation anyways.

What do you guys think?

Thanks.










> DFS Used space is not correct computed on frequent append operations
> 
>
> Key: HDFS-6489
> URL: https://issues.apache.org/jira/browse/HDFS-6489
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.2.0, 2.7.1, 2.7.2
>Reporter: stanley shi
>Priority: Major
> Attachments: HDFS-6489.001.patch, HDFS-6489.002.patch, 
> HDFS-6489.003.patch, HDFS-6489.004.patch, HDFS-6489.005.patch, 
> HDFS-6489.006.patch, HDFS-6489.007.patch, HDFS6489.java
>
>
> The current implementation of the Datanode will increase the DFS used space 
> on each block write operation. This is correct in most scenario (create new 
> file), but sometimes it will behave in-correct(append small data to a large 
> block).
> For example, I have a file with only one block(say, 60M). Then I try to 
> append to it very frequently but each time I append only 10 bytes;
> Then on each append, dfs used will be increased with the length of the 
> block(60M), not teh actual data length(10bytes).
> Consider in a scenario I use many clients to append concurrently to a large 
> number of files (1000+), assume the block size is 32M (half of the default 
> value), then the dfs used will be increased 1000*32M = 32G on each append to 
> the files; but actually I only write 10K bytes; this will cause the datanode 
> to report in-sufficient disk space on data write.
> {quote}2014-06-04 15:27:34,719 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock  
> BP-1649188734-10.37.7.142-1398844098971:blk_1073742834_45306 received 
> exception org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: 
> Insufficient space for appending to FinalizedReplica, blk_1073742834_45306, 
> FINALIZED{quote}
> But the actual disk usage:
> {quote}
> [root@hdsh143 ~]# df -h
> FilesystemSize  Used Avail Use% Mounted on
> /dev/sda3  16G  2.9G   13G  20% /
> tmpfs 1.9G   72K  1.9G   1% /dev/shm
> /dev/sda1  97M   32M   61M  35% /boot
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13388) RequestHedgingProxyProvider calls multiple configured NNs all the time

2018-05-17 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480073#comment-16480073
 ] 

Yongjun Zhang commented on HDFS-13388:
--

HI [~elgoiri],

Would you please take a look at my comment above. Wonder what you think.

Thanks.

 

> RequestHedgingProxyProvider calls multiple configured NNs all the time
> --
>
> Key: HDFS-13388
> URL: https://issues.apache.org/jira/browse/HDFS-13388
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: HADOOP-13388.0001.patch, HADOOP-13388.0002.patch, 
> HADOOP-13388.0003.patch, HADOOP-13388.0004.patch, HADOOP-13388.0005.patch, 
> HADOOP-13388.0006.patch, HADOOP-13388.0007.patch, HADOOP-13388.0008.patch, 
> HADOOP-13388.0009.patch, HADOOP-13388.0010.patch, HADOOP-13388.0011.patch, 
> HADOOP-13388.0012.patch, HADOOP-13388.0013.patch, HADOOP-13388.0014.patch
>
>
> In HDFS-7858 RequestHedgingProxyProvider was designed to "first 
> simultaneously call multiple configured NNs to decide which is the active 
> Namenode and then for subsequent calls it will invoke the previously 
> successful NN ." But the current code call multiple configured NNs every time 
> even when we already got the successful NN. 
>  That's because in RetryInvocationHandler.java, ProxyDescriptor's member 
> proxyInfo is assigned only when it is constructed or when failover occurs. 
> RequestHedgingProxyProvider.currentUsedProxy is null in both cases, so the 
> only proxy we can get is always a dynamic proxy handled by 
> RequestHedgingInvocationHandler.class. RequestHedgingInvocationHandler.class 
> handles invoked method by calling multiple configured NNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13388) RequestHedgingProxyProvider calls multiple configured NNs all the time

2018-05-10 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470764#comment-16470764
 ] 

Yongjun Zhang commented on HDFS-13388:
--

HI [~elgoiri],

Thanks for staying on top. I have been doing Jira gardening, have not cut 
branch for 3.0.3 yet. Since the Jira is in 3.0.2, it should be nice to have in 
3.0.3. Can the issue be fixed and put into 3.0.3?

Thanks.

 

> RequestHedgingProxyProvider calls multiple configured NNs all the time
> --
>
> Key: HDFS-13388
> URL: https://issues.apache.org/jira/browse/HDFS-13388
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: HADOOP-13388.0001.patch, HADOOP-13388.0002.patch, 
> HADOOP-13388.0003.patch, HADOOP-13388.0004.patch, HADOOP-13388.0005.patch, 
> HADOOP-13388.0006.patch, HADOOP-13388.0007.patch, HADOOP-13388.0008.patch, 
> HADOOP-13388.0009.patch, HADOOP-13388.0010.patch, HADOOP-13388.0011.patch, 
> HADOOP-13388.0012.patch, HADOOP-13388.0013.patch, HADOOP-13388.0014.patch
>
>
> In HDFS-7858 RequestHedgingProxyProvider was designed to "first 
> simultaneously call multiple configured NNs to decide which is the active 
> Namenode and then for subsequent calls it will invoke the previously 
> successful NN ." But the current code call multiple configured NNs every time 
> even when we already got the successful NN. 
>  That's because in RetryInvocationHandler.java, ProxyDescriptor's member 
> proxyInfo is assigned only when it is constructed or when failover occurs. 
> RequestHedgingProxyProvider.currentUsedProxy is null in both cases, so the 
> only proxy we can get is always a dynamic proxy handled by 
> RequestHedgingInvocationHandler.class. RequestHedgingInvocationHandler.class 
> handles invoked method by calling multiple configured NNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-13388) RequestHedgingProxyProvider calls multiple configured NNs all the time

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469299#comment-16469299
 ] 

Yongjun Zhang edited comment on HDFS-13388 at 5/9/18 10:10 PM:
---

Hi [~elgoiri],

This Jira was released in 3.0.2 then get reverted from branch-3.0. Seems 
reasonable to get it to branch-3.0, which is now targeting for 3.0.3. Would you 
please do so?

Thanks.


was (Author: yzhangal):
Hi [~elgoiri],

This Jira was released in 3.0.2 then get reverted from branch-3.0. Seems 
reasonable to get it to branch-3.0, which is now targeting for 3.0.3. Would you 
please do so?

Thanks.

 

 

 

> RequestHedgingProxyProvider calls multiple configured NNs all the time
> --
>
> Key: HDFS-13388
> URL: https://issues.apache.org/jira/browse/HDFS-13388
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.3
>
> Attachments: HADOOP-13388.0001.patch, HADOOP-13388.0002.patch, 
> HADOOP-13388.0003.patch, HADOOP-13388.0004.patch, HADOOP-13388.0005.patch, 
> HADOOP-13388.0006.patch, HADOOP-13388.0007.patch, HADOOP-13388.0008.patch, 
> HADOOP-13388.0009.patch, HADOOP-13388.0010.patch, HADOOP-13388.0011.patch, 
> HADOOP-13388.0012.patch, HADOOP-13388.0013.patch, HADOOP-13388.0014.patch
>
>
> In HDFS-7858 RequestHedgingProxyProvider was designed to "first 
> simultaneously call multiple configured NNs to decide which is the active 
> Namenode and then for subsequent calls it will invoke the previously 
> successful NN ." But the current code call multiple configured NNs every time 
> even when we already got the successful NN. 
>  That's because in RetryInvocationHandler.java, ProxyDescriptor's member 
> proxyInfo is assigned only when it is constructed or when failover occurs. 
> RequestHedgingProxyProvider.currentUsedProxy is null in both cases, so the 
> only proxy we can get is always a dynamic proxy handled by 
> RequestHedgingInvocationHandler.class. RequestHedgingInvocationHandler.class 
> handles invoked method by calling multiple configured NNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13388) RequestHedgingProxyProvider calls multiple configured NNs all the time

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469583#comment-16469583
 ] 

Yongjun Zhang commented on HDFS-13388:
--

Welcome, and thanks for taking care of that, [~elgoiri]. I saw it in branch-3.0 
now. 

 

> RequestHedgingProxyProvider calls multiple configured NNs all the time
> --
>
> Key: HDFS-13388
> URL: https://issues.apache.org/jira/browse/HDFS-13388
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.3
>
> Attachments: HADOOP-13388.0001.patch, HADOOP-13388.0002.patch, 
> HADOOP-13388.0003.patch, HADOOP-13388.0004.patch, HADOOP-13388.0005.patch, 
> HADOOP-13388.0006.patch, HADOOP-13388.0007.patch, HADOOP-13388.0008.patch, 
> HADOOP-13388.0009.patch, HADOOP-13388.0010.patch, HADOOP-13388.0011.patch, 
> HADOOP-13388.0012.patch, HADOOP-13388.0013.patch, HADOOP-13388.0014.patch
>
>
> In HDFS-7858 RequestHedgingProxyProvider was designed to "first 
> simultaneously call multiple configured NNs to decide which is the active 
> Namenode and then for subsequent calls it will invoke the previously 
> successful NN ." But the current code call multiple configured NNs every time 
> even when we already got the successful NN. 
>  That's because in RetryInvocationHandler.java, ProxyDescriptor's member 
> proxyInfo is assigned only when it is constructed or when failover occurs. 
> RequestHedgingProxyProvider.currentUsedProxy is null in both cases, so the 
> only proxy we can get is always a dynamic proxy handled by 
> RequestHedgingInvocationHandler.class. RequestHedgingInvocationHandler.class 
> handles invoked method by calling multiple configured NNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13430) Fix TestEncryptionZonesWithKMS failure due to HADOOP-14445

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469573#comment-16469573
 ] 

Yongjun Zhang commented on HDFS-13430:
--

Thanks [~shahrs87].

 

> Fix TestEncryptionZonesWithKMS failure due to HADOOP-14445
> --
>
> Key: HDFS-13430
> URL: https://issues.apache.org/jira/browse/HDFS-13430
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HDFS-13430.01.patch
>
>
> Unfortunately HADOOP-14445 had an HDFS test failure that's not caught in the 
> hadoop-common precommit runs.
> This is caught by our internal pre-commit using dist-test, and appears to be 
> the only failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13434) RBF: Fix dead links in RBF document

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469463#comment-16469463
 ] 

Yongjun Zhang commented on HDFS-13434:
--

Hi [~elgoiri],

Thanks for working on this issue, would you please update the fix versions 
accordingly?

 

> RBF: Fix dead links in RBF document
> ---
>
> Key: HDFS-13434
> URL: https://issues.apache.org/jira/browse/HDFS-13434
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Akira Ajisaka
>Assignee: Chetna Chaudhari
>Priority: Major
>  Labels: newbie
> Attachments: HDFS-13434.patch
>
>
> There are many dead links in 
> [http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html.]
>  Let's fix them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13430) Fix TestEncryptionZonesWithKMS failure due to HADOOP-14445

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469457#comment-16469457
 ] 

Yongjun Zhang commented on HDFS-13430:
--

Hi [~xiaochen] and [~shahrs87],

Thank you guys for working on HADOOP-14445 and this one here. Are all of them 
reverted from ALL branches?

 

 

> Fix TestEncryptionZonesWithKMS failure due to HADOOP-14445
> --
>
> Key: HDFS-13430
> URL: https://issues.apache.org/jira/browse/HDFS-13430
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HDFS-13430.01.patch
>
>
> Unfortunately HADOOP-14445 had an HDFS test failure that's not caught in the 
> hadoop-common precommit runs.
> This is caught by our internal pre-commit using dist-test, and appears to be 
> the only failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13388) RequestHedgingProxyProvider calls multiple configured NNs all the time

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469299#comment-16469299
 ] 

Yongjun Zhang commented on HDFS-13388:
--

Hi [~elgoiri],

This Jira was released in 3.0.2 then get reverted from branch-3.0. Seems 
reasonable to get it to branch-3.0, which is now targeting for 3.0.3. Would you 
please do so?

Thanks.

 

 

 

> RequestHedgingProxyProvider calls multiple configured NNs all the time
> --
>
> Key: HDFS-13388
> URL: https://issues.apache.org/jira/browse/HDFS-13388
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HADOOP-13388.0001.patch, HADOOP-13388.0002.patch, 
> HADOOP-13388.0003.patch, HADOOP-13388.0004.patch, HADOOP-13388.0005.patch, 
> HADOOP-13388.0006.patch, HADOOP-13388.0007.patch, HADOOP-13388.0008.patch, 
> HADOOP-13388.0009.patch, HADOOP-13388.0010.patch, HADOOP-13388.0011.patch, 
> HADOOP-13388.0012.patch, HADOOP-13388.0013.patch, HADOOP-13388.0014.patch
>
>
> In HDFS-7858 RequestHedgingProxyProvider was designed to "first 
> simultaneously call multiple configured NNs to decide which is the active 
> Namenode and then for subsequent calls it will invoke the previously 
> successful NN ." But the current code call multiple configured NNs every time 
> even when we already got the successful NN. 
>  That's because in RetryInvocationHandler.java, ProxyDescriptor's member 
> proxyInfo is assigned only when it is constructed or when failover occurs. 
> RequestHedgingProxyProvider.currentUsedProxy is null in both cases, so the 
> only proxy we can get is always a dynamic proxy handled by 
> RequestHedgingInvocationHandler.class. RequestHedgingInvocationHandler.class 
> handles invoked method by calling multiple configured NNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-13136) Avoid taking FSN lock while doing group member lookup for FSD permission check

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469290#comment-16469290
 ] 

Yongjun Zhang edited comment on HDFS-13136 at 5/9/18 6:50 PM:
--

HI [~xyao],

Thanks for your work here, could it be Resolved since it's committed?

I saw it's in branch-3.0 which will target for 3.0.3.

Thanks.


was (Author: yzhangal):
HI [~xyao],

Thanks for your work here, could it be Resolved since it's committed?

 

> Avoid taking FSN lock while doing group member lookup for FSD permission check
> --
>
> Key: HDFS-13136
> URL: https://issues.apache.org/jira/browse/HDFS-13136
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoyu Yao
>Assignee: Xiaoyu Yao
>Priority: Major
> Attachments: HDFS-13136-branch-3.0.001.patch, 
> HDFS-13136-branch-3.0.002.patch, HDFS-13136.001.patch, HDFS-13136.002.patch
>
>
> Namenode has FSN lock and FSD lock. Most of the namenode operations need to 
> take FSN lock first and then FSD lock.  The permission check is done via 
> FSPermissionChecker at FSD layer assuming FSN lock is taken. 
> The FSPermissionChecker constructor invokes callerUgi.getGroups() that can 
> take seconds sometimes. There are external cache scheme such SSSD and 
> internal cache scheme for group lookup. However, the delay could still occur 
> during cache refresh, which causes severe FSN lock contentions and 
> unresponsive namenode issues.
> Checking the current code, we found that getBlockLocations(..) did it right 
> but some methods such as getFileInfo(..), getContentSummary(..) did it wrong. 
> This ticket is open to ensure the group lookup for permission checker is 
> outside the FSN lock.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13136) Avoid taking FSN lock while doing group member lookup for FSD permission check

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469290#comment-16469290
 ] 

Yongjun Zhang commented on HDFS-13136:
--

HI [~xyao],

Thanks for your work here, could it be Resolved since it's committed?

 

> Avoid taking FSN lock while doing group member lookup for FSD permission check
> --
>
> Key: HDFS-13136
> URL: https://issues.apache.org/jira/browse/HDFS-13136
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoyu Yao
>Assignee: Xiaoyu Yao
>Priority: Major
> Attachments: HDFS-13136-branch-3.0.001.patch, 
> HDFS-13136-branch-3.0.002.patch, HDFS-13136.001.patch, HDFS-13136.002.patch
>
>
> Namenode has FSN lock and FSD lock. Most of the namenode operations need to 
> take FSN lock first and then FSD lock.  The permission check is done via 
> FSPermissionChecker at FSD layer assuming FSN lock is taken. 
> The FSPermissionChecker constructor invokes callerUgi.getGroups() that can 
> take seconds sometimes. There are external cache scheme such SSSD and 
> internal cache scheme for group lookup. However, the delay could still occur 
> during cache refresh, which causes severe FSN lock contentions and 
> unresponsive namenode issues.
> Checking the current code, we found that getBlockLocations(..) did it right 
> but some methods such as getFileInfo(..), getContentSummary(..) did it wrong. 
> This ticket is open to ensure the group lookup for permission checker is 
> outside the FSN lock.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13062) Provide support for JN to use separate journal disk per namespace

2018-05-09 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13062:
-
Fix Version/s: (was: 3.0.1)
   3.0.3

> Provide support for JN to use separate journal disk per namespace
> -
>
> Key: HDFS-13062
> URL: https://issues.apache.org/jira/browse/HDFS-13062
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: federation, journal-node
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
> Fix For: 3.1.0, 3.0.3
>
> Attachments: HDFS-13062.00.patch, HDFS-13062.01.patch, 
> HDFS-13062.02.patch, HDFS-13062.03.patch, HDFS-13062.04.patch, 
> HDFS-13062.05.patch, HDFS-13062.06.patch
>
>
> In Federated HA setup, provide support for separate journal disk for each 
> namespace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13048) LowRedundancyReplicatedBlocks metric can be negative

2018-05-09 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469281#comment-16469281
 ] 

Yongjun Zhang commented on HDFS-13048:
--

Hi [~ajisakaa],

FYI, I just updated the fix version to 3.0.3 from 3.0.1, since I don't see it 
in 3.0.1. Thanks.

 

> LowRedundancyReplicatedBlocks metric can be negative
> 
>
> Key: HDFS-13048
> URL: https://issues.apache.org/jira/browse/HDFS-13048
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 3.0.0
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
> Fix For: 3.1.0, 3.0.3
>
> Attachments: HDFS-13048-sample.patch, HDFS-13048.001.patch, 
> HDFS-13048.002.patch
>
>
> I'm seeing {{LowRedundancyReplicatedBlocks}} become negative. This should be 
> 0 or positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13048) LowRedundancyReplicatedBlocks metric can be negative

2018-05-09 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13048:
-
Fix Version/s: (was: 3.0.1)
   3.0.3

> LowRedundancyReplicatedBlocks metric can be negative
> 
>
> Key: HDFS-13048
> URL: https://issues.apache.org/jira/browse/HDFS-13048
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 3.0.0
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
> Fix For: 3.1.0, 3.0.3
>
> Attachments: HDFS-13048-sample.patch, HDFS-13048.001.patch, 
> HDFS-13048.002.patch
>
>
> I'm seeing {{LowRedundancyReplicatedBlocks}} become negative. This should be 
> 0 or positive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13315) Add a test for the issue reported in HDFS-11481 which is fixed by HDFS-10997.

2018-05-09 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13315:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Add a test for the issue reported in HDFS-11481 which is fixed by HDFS-10997.
> -
>
> Key: HDFS-13315
> URL: https://issues.apache.org/jira/browse/HDFS-13315
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Yongjun Zhang
>Assignee: Yongjun Zhang
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: HDFS-13315.001.patch, HDFS-13315.002.patch, 
> TEST-org.apache.hadoop.hdfs.TestEncryptionZones.xml
>
>
> HDFS-11481 reported that hdfs snapshotDiff /.reserved/raw/... fails on 
> snapshottable directories. It turns out that HDFS-10997 fixed the issue as a 
> byproduct. This jira is to add a test for the HDFS-11481 issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.

2018-05-09 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-10453:
-
Fix Version/s: (was: 3.0.1)
   3.0.3

> ReplicationMonitor thread could stuck for long time due to the race between 
> replication and delete of same file in a large cluster.
> ---
>
> Key: HDFS-10453
> URL: https://issues.apache.org/jira/browse/HDFS-10453
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.4.1, 2.5.2, 2.7.1, 2.6.4
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 2.7.6, 3.0.3
>
> Attachments: HDFS-10453-branch-2.001.patch, 
> HDFS-10453-branch-2.003.patch, HDFS-10453-branch-2.7.004.patch, 
> HDFS-10453-branch-2.7.005.patch, HDFS-10453-branch-2.7.006.patch, 
> HDFS-10453-branch-2.7.007.patch, HDFS-10453-branch-2.7.008.patch, 
> HDFS-10453-branch-2.7.009.patch, HDFS-10453-branch-2.8.001.patch, 
> HDFS-10453-branch-2.8.002.patch, HDFS-10453-branch-2.9.001.patch, 
> HDFS-10453-branch-2.9.002.patch, HDFS-10453-branch-3.0.001.patch, 
> HDFS-10453-branch-3.0.002.patch, HDFS-10453-trunk.001.patch, 
> HDFS-10453-trunk.002.patch, HDFS-10453.001.patch
>
>
> ReplicationMonitor thread could stuck for long time and loss data with little 
> probability. Consider the typical scenario：
> (1) create and close a file with the default replicas(3);
> (2) increase replication (to 10) of the file.
> (3) delete the file while ReplicationMonitor is scheduling blocks belong to 
> that file for replications.
> if ReplicationMonitor stuck reappeared, NameNode will print log as:
> {code:xml}
> 2016-04-19 10:20:48,083 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> ..
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) For more information, please enable DEBUG log level on 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough 
> replicas: expected size is 7 but only 0 storage types can be selected 
> (replication=10, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK, 
> DISK, DISK, DISK, DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2016-04-19 10:21:17,184 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in need of 7 to reach 10 
> (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, 
> newBlock=false) All required storage types are unavailable:  
> unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, 
> storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
> {code}
> This is because 2 threads (#NameNodeRpcServer and #ReplicationMonitor) 
> process same block at the same moment.
> (1) ReplicationMonitor#computeReplicationWorkForBlocks get blocks to 
> replicate and leave the global lock.
> (2) FSNamesystem#delete invoked to delete blocks then clear the reference in 
> blocksmap, needReplications, etc. the block's NumBytes will set 
> NO_ACK(Long.MAX_VALUE) which is used to indicate that the block deletion does 
> not need explicit ACK from the node. 
> (3) ReplicationMonitor#computeReplicationWorkForBlocks continue to 
> chooseTargets for the same blocks and no node will be selected after traverse 
> whole cluster because  no node choice satisfy the goodness criteria 
> (remaining spaces achieve required size Long.MAX_VALUE). 
> During of stage#3 ReplicationMonitor stuck for long time, especial in a large 
> cluster. invalidateBlocks & neededReplications continues to grow and no 
> consumes. it will loss data at the worst.
> This can mostly be avoided by skip chooseTarget for BlockCommand.NO_ACK block 
> and remove it from neededReplications.



--
This message was sent by

[jira] [Updated] (HDFS-10183) Prevent race condition during class initialization

2018-05-09 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-10183:
-
Fix Version/s: 3.0.3

> Prevent race condition during class initialization
> --
>
> Key: HDFS-10183
> URL: https://issues.apache.org/jira/browse/HDFS-10183
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.9.0
>Reporter: Pavel Avgustinov
>Assignee: Pavel Avgustinov
>Priority: Minor
> Fix For: 2.8.4, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HADOOP-12944.1.patch, HDFS-10183.2.patch
>
>
> In HADOOP-11969, [~busbey] tracked down a non-deterministic 
> {{NullPointerException}} to an oddity in the Java memory model: When multiple 
> threads trigger the loading of a class at the same time, one of them wins and 
> creates the {{java.lang.Class}} instance; the others block during this 
> initialization, but once it is complete they may obtain a reference to the 
> {{Class}} which has non-{{final}} fields still containing their default (i.e. 
> {{null}}) values. This leads to runtime failures that are hard to debug or 
> diagnose.
> HADOOP-11969 observed that {{ThreadLocal}} fields, by their very nature, are 
> very likely to be accessed from multiple threads, and thus the problem is 
> particularly severe there. Consequently, the patch removed all occurrences of 
> the issue in the code base.
> Unfortunately, since then HDFS-7964 has [reverted one of the fixes during a 
> refactoring|https://github.com/apache/hadoop/commit/2151716832ad14932dd65b1a4e47e64d8d6cd767#diff-0c2e9f7f9e685f38d1a11373b627cfa6R151],
>  and introduced a [new instance of the 
> problem|https://github.com/apache/hadoop/commit/2151716832ad14932dd65b1a4e47e64d8d6cd767#diff-6334d0df7d9aefbccd12b21bb7603169R43].
> The attached patch addresses the issue by adding the missing {{final}} 
> modifier in these two cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13428) RBF: Remove LinkedList From StateStoreFileImpl.java

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13428:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Remove LinkedList From StateStoreFileImpl.java
> ---
>
> Key: HDFS-13428
> URL: https://issues.apache.org/jira/browse/HDFS-13428
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation
>Affects Versions: 3.0.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13428.1.patch
>
>
> Replace {{LinkedList}} with {{ArrayList}} implementation in the 
> StateStoreFileImpl class.  This is especially advantageous because we can 
> pre-allocate the internal array before a copy occurs.  {{ArrayList}} is 
> faster for iterations and requires less memory than {{LinkedList}}.
> {code:java}
>   protected List getChildren(String path) {
> List ret = new LinkedList<>();
> File dir = new File(path);
> File[] files = dir.listFiles();
> if (files != null) {
>   for (File file : files) {
> String filename = file.getName();
> ret.add(filename);
>   }
> }
> return ret;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13508) RBF: Normalize paths (automatically) when adding, updating, removing or listing mount table entries

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13508:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Normalize paths (automatically) when adding, updating, removing or 
> listing mount table entries
> ---
>
> Key: HDFS-13508
> URL: https://issues.apache.org/jira/browse/HDFS-13508
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ekanth S
>Assignee: Ekanth S
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13508.001.patch, HDFS-13508.002.patch, 
> HDFS-13508.003.patch
>
>
> me@gateway-hawaii-all:/mnt/host/bin$ hdfs dfsrouteradmin -ls /home/move 
> Mount Table Entries:
> Source Destinations Owner Group Mode 
> /home/move hdfs-oahu->/home/move me hadoop rwxr-xr-x
> me@gateway-hawaii-all:/mnt/host/bin$ hdfs dfsrouteradmin -ls /home/move/
> Mount Table Entries:
> Source Destinations Owner Group Mode
> me@gateway-hawaii-all:/mnt/host/bin$ hdfs dfsrouteradmin -rm /home/move/
> Cannot remove mount point /home/move/
> me@gateway-hawaii-all:/mnt/host/bin$ hdfs dfsrouteradmin -add /home/move/ 
> hdfs-oahu /home/move/ -readonly
> Cannot add mount point /home/move/
> The slash '/' at the end should be normalized before calling the API from the 
> CLI.
> Note: add command fails with a terminating '/' . when it is an existing entry 
> (it checks the not-normalized value with the normalized value in the 
> mount-table). Adding a new mount point with '/' at the end works because the 
> CLI normalizes the mount before calling the API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13326) RBF: Improve the interfaces to modify and view mount tables

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13326:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Improve the interfaces to modify and view mount tables
> ---
>
> Key: HDFS-13326
> URL: https://issues.apache.org/jira/browse/HDFS-13326
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Wei Yan
>Assignee: Gang Li
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13326.000.patch, HDFS-13326.001.patch, 
> HDFS-13326.002.patch
>
>
> From DFSRouterAdmin cmd, currently the update logic is implemented inside add 
> operation, where it has some limitation (e.g. cannot update "readonly" or 
> removing a destination).  Given the RPC alreadys separate add and update 
> operations, it would be better to do the same in cmd level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13499) RBF: Show disabled name services in the UI

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13499:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Show disabled name services in the UI
> --
>
> Key: HDFS-13499
> URL: https://issues.apache.org/jira/browse/HDFS-13499
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.1
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13499.000.patch, disabledUI.png
>
>
> HDFS-13484 exposes the disabled name services. This JIRA should show them in 
> the Web UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13402) RBF: Fix java doc for StateStoreFileSystemImpl

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13402:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Fix  java doc for StateStoreFileSystemImpl
> ---
>
> Key: HDFS-13402
> URL: https://issues.apache.org/jira/browse/HDFS-13402
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Affects Versions: 3.0.0
>Reporter: Yiran Wu
>Assignee: Yiran Wu
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13402.001.patch, HDFS-13402.002.patch, 
> HDFS-13402.003.patch
>
>
> {code:java}
> /**
>  *StateStoreDriver}implementation based on a filesystem. The most common uses
>  * HDFS as a backend.
>  */
> {code}
> to
> {code:java}
> /**
>  * {@link StateStoreDriver} implementation based on a filesystem. The common
>  * implementation uses HDFS as a backend. The path can be specified setting
>  * dfs.federation.router.driver.fs.path=hdfs://host:port/path/to/store.
>  */
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13045) RBF: Improve error message returned from subcluster

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13045:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Improve error message returned from subcluster
> ---
>
> Key: HDFS-13045
> URL: https://issues.apache.org/jira/browse/HDFS-13045
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Wei Yan
>Assignee: Íñigo Goiri
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13045.000.patch, HDFS-13045.001.patch, 
> HDFS-13045.002.patch, HDFS-13045.003.patch, HDFS-13045.004.patch
>
>
> Currently, Router directly returns exception response from subcluster to 
> client, which may not have the correct error message, especially when the 
> error message containing a path.
> One example, we have a mount path "/a/b" mapped to subclusterA's "/c/d". If 
> user1 does a chown operation on "/a/b", and he doesn't have corresponding 
> privilege, currently the error msg looks like "Permission denied. user=user1 
> is not the owner of inode=/c/d", which may confuse user. Would be better to 
> reverse the path back to original mount path.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13384) RBF: Improve timeout RPC call mechanism

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13384:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Improve timeout RPC call mechanism
> ---
>
> Key: HDFS-13384
> URL: https://issues.apache.org/jira/browse/HDFS-13384
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13384.000.patch, HDFS-13384.001.patch, 
> HDFS-13384.002.patch, HDFS-13384.003.patch, HDFS-13384.004.patch
>
>
> When issuing RPC requests to subclusters, we have a time out mechanism 
> introduced in HDFS-12273. We need to improve this is handled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13410) RBF: Support federation with no subclusters

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13410:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Support federation with no subclusters
> ---
>
> Key: HDFS-13410
> URL: https://issues.apache.org/jira/browse/HDFS-13410
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.0
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13410.000.patch, HDFS-13410.001.patch, 
> HDFS-13410.002.patch
>
>
> If the federation has no subclusters the logs have long stack traces. Even 
> though this is not a regular setup for RBF, we should trigger log message.
> An example:
> {code}
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.LinkedList.checkElementIndex(LinkedList.java:555)
>   at java.util.LinkedList.get(LinkedList.java:476)
>   at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeConcurrent(RouterRpcClient.java:1028)
>   at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getDatanodeReport(RouterRpcServer.java:1264)
>   at 
> org.apache.hadoop.hdfs.server.federation.metrics.FederationMetrics.getNodeUsage(FederationMetrics.java:424)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13466) RBF: Add more router-related information to the UI

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13466:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Add more router-related information to the UI
> --
>
> Key: HDFS-13466
> URL: https://issues.apache.org/jira/browse/HDFS-13466
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13466.001.patch, pic.png
>
>
> Currently in NameNode UI, the Summary part also includes information:
> {noformat}
> Security is off.
> Safemode is off.
>  files and directories, * blocks =  total filesystem object(s).
> Heap Memory used  GB of  GB Heap Memory. Max Heap Memory is  GB.
> Non Heap Memory used  MB of  MB Commited Non Heap Memory. Max Non 
> Heap Memory is .
> {noformat}
> We could add similar information for router, for better visibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13386) RBF: Wrong date information in list file(-ls) result

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13386:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Wrong date information in list file(-ls) result
> 
>
> Key: HDFS-13386
> URL: https://issues.apache.org/jira/browse/HDFS-13386
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Dibyendu Karmakar
>Assignee: Dibyendu Karmakar
>Priority: Minor
> Fix For: 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13386-002.patch, HDFS-13386-003.patch, 
> HDFS-13386-004.patch, HDFS-13386-005.patch, HDFS-13386-006.patch, 
> HDFS-13386-007.patch, HDFS-13386.000.patch, HDFS-13386.001.patch, 
> image-2018-04-03-11-59-51-623.png
>
>
> # hdfs dfs -ls 
> !image-2018-04-03-11-59-51-623.png!
> this is happening because getMountPointDates is not implemented 
> {code:java}
> private Map getMountPointDates(String path) {
> Map ret = new TreeMap<>();
> // TODO add when we have a Mount Table
> return ret;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13478) RBF: Disabled Nameservice store API

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13478:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Disabled Nameservice store API
> ---
>
> Key: HDFS-13478
> URL: https://issues.apache.org/jira/browse/HDFS-13478
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.1
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13478.000.patch, HDFS-13478.001.patch, 
> HDFS-13478.002.patch, HDFS-13478.003.patch, HDFS-13478.004.patch, 
> HDFS-13478.005.patch
>
>
> We have a subcluster in our federation that is for testing and is 
> missbehaving. This has a negative impact on the performance with operations 
> that go to every subcluster (e.g., renewLease() or setSafeMode()).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13488) RBF: Reject requests when a Router is overloaded

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13488:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Reject requests when a Router is overloaded
> 
>
> Key: HDFS-13488
> URL: https://issues.apache.org/jira/browse/HDFS-13488
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.1
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13488.000.patch, HDFS-13488.001.patch, 
> HDFS-13488.002.patch, HDFS-13488.003.patch, HDFS-13488.004.patch
>
>
> A Router might be overloaded when handling special cases (e.g. a slow 
> subcluster). The Router could reject the requests and the client could try 
> with another Router. We should leverage the Standby mechanism for this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13525) RBF: Add unit test TestStateStoreDisabledNameservice

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13525:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Add unit test TestStateStoreDisabledNameservice
> 
>
> Key: HDFS-13525
> URL: https://issues.apache.org/jira/browse/HDFS-13525
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13525.001.patch
>
>
> Add unit test for the store for DisabledNameservice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13503) Fix TestFsck test failures on Windows

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13503:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> Fix TestFsck test failures on Windows
> -
>
> Key: HDFS-13503
> URL: https://issues.apache.org/jira/browse/HDFS-13503
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: hdfs
>Reporter: Xiao Liang
>Assignee: Xiao Liang
>Priority: Major
>  Labels: windows
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13503-branch-2.000.patch, 
> HDFS-13503-branch-2.001.patch, HDFS-13503.000.patch, HDFS-13503.001.patch
>
>
> Test failures on Windows caused by the same reason as HDFS-13336, similar fix 
> needed for TestFsck basing on HDFS-13408.
> MiniDFSCluster also needs a small fix for the getStorageDir() interface, 
> which should use determineDfsBaseDir() to get the correct path of the data 
> directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13283) Percentage based Reserved Space Calculation for DataNode

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13283:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> Percentage based Reserved Space Calculation for DataNode
> 
>
> Key: HDFS-13283
> URL: https://issues.apache.org/jira/browse/HDFS-13283
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13283.000.patch, HDFS-13283.001.patch, 
> HDFS-13283.002.patch, HDFS-13283.003.patch, HDFS-13283.004.patch, 
> HDFS-13283.005.patch, HDFS-13283.006.patch, HDFS-13283.007.patch, 
> HDFS-13283_branch-2.000.patch, HDFS-13283_branch-3.0.000.patch
>
>
> Currently, the only way to configure reserved disk space for non-HDFS data on 
> a DataNode is a constant value via {{dfs.datanode.du.reserved}}. This can be 
> an issue in non-heterogeneous clusters where size of DNs can differ. The 
> proposed solution is to allow percentage based configuration (and their 
> combination):
>  # ABSOLUTE
>  ** based on absolute number of reserved space
>  # PERCENTAGE
>  ** based on percentage of total capacity in the storage
>  # CONSERVATIVE
>  ** calculates both of the above and takes the one that will yield more 
> reserved space
>  # AGGRESSIVE
>  ** calculates 1. 2. and takes the one that will yield less reserved space
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13484) RBF: Disable Nameservices from the federation

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13484:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Disable Nameservices from the federation
> -
>
> Key: HDFS-13484
> URL: https://issues.apache.org/jira/browse/HDFS-13484
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.1
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13484.000.patch, HDFS-13484.001.patch, 
> HDFS-13484.002.patch, HDFS-13484.003.patch, HDFS-13484.004.patch, 
> HDFS-13484.005.patch, HDFS-13484.006.patch, HDFS-13484.007.patch, 
> HDFS-13484.008.patch, HDFS-13484.009.patch
>
>
> HDFS-13478 introduced the Decommission store. We should disable the access to 
> decommissioned subclusters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13509) Bug fix for breakHardlinks() of ReplicaInfo/LocalReplica, and fix TestFileAppend failures on Windows

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13509:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> Bug fix for breakHardlinks() of ReplicaInfo/LocalReplica, and fix 
> TestFileAppend failures on Windows
> 
>
> Key: HDFS-13509
> URL: https://issues.apache.org/jira/browse/HDFS-13509
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiao Liang
>Assignee: Xiao Liang
>Priority: Major
>  Labels: windows
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13509-branch-2.000.patch, HDFS-13509.000.patch, 
> HDFS-13509.001.patch, HDFS-13509.002.patch
>
>
> breakHardlinks() of ReplicaInfo(branch-2)/LocalReplica(trunk) replaces file 
> while the source is still opened as input stream, which will fail and throw 
> exception on Windows. It's the cause of  unit test case 
> org.apache.hadoop.hdfs.TestFileAppend#testBreakHardlinksIfNeeded failure on 
> Windows.
> Other test cases of TestFileAppend fail randomly on Windows due to sharing 
> the same test folder, and the solution is using randomized base dir of 
> MiniDFSCluster via HDFS-13408



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13336) Test cases of TestWriteToReplica failed in windows

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13336:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> Test cases of TestWriteToReplica failed in windows
> --
>
> Key: HDFS-13336
> URL: https://issues.apache.org/jira/browse/HDFS-13336
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Xiao Liang
>Assignee: Xiao Liang
>Priority: Major
>  Labels: windows
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13336.000.patch, HDFS-13336.001.patch, 
> HDFS-13336.002.patch, HDFS-13336.003.patch
>
>
> Test cases of TestWriteToReplica failed in windows with errors like:
> h4. 
> !https://builds.apache.org/static/fc5100d0/images/16x16/document_delete.png!  
> Error Details
> Could not fully delete 
> F:\short\hadoop-trunk-win\s\hadoop-hdfs-project\hadoop-hdfs\target\test\data\1\dfs\name-0-1
> h4. 
> !https://builds.apache.org/static/fc5100d0/images/16x16/document_delete.png!  
> Stack Trace
> java.io.IOException: Could not fully delete 
> F:\short\hadoop-trunk-win\s\hadoop-hdfs-project\hadoop-hdfs\target\test\data\1\dfs\name-0-1
>  at 
> org.apache.hadoop.hdfs.MiniDFSCluster.configureNameService(MiniDFSCluster.java:1011)
>  at 
> org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:932)
>  at 
> org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:864)
>  at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:497) at 
> org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:456) 
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestWriteToReplica.testAppend(TestWriteToReplica.java:89)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:309) at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239)
>  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:373)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:334)
>  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:119) 
> at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:407)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13380) RBF: mv/rm fail after the directory exceeded the quota limit

2018-05-08 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468310#comment-16468310
 ] 

Yongjun Zhang commented on HDFS-13380:
--

Hi [~elgoiri],

Thanks you guys for working on this, I found that this Jira is not in 
branch-3.0 but it 3.0.4 is in the Fix Version/s. Would you please put it into 
branch-3.0 if it's intended?

Thanks.

 

> RBF: mv/rm fail after the directory exceeded the quota limit
> 
>
> Key: HDFS-13380
> URL: https://issues.apache.org/jira/browse/HDFS-13380
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Weiwei Wu
>Assignee: Yiqun Lin
>Priority: Major
> Fix For: 2.10.0, 3.2.0
>
> Attachments: HDFS-13380.001.patch, HDFS-13380.002.patch
>
>
> It's always fail when I try to mv/rm a directory which have exceeded the 
> quota limit.
> {code:java}
> [hadp@hadoop]$ hdfs dfsrouteradmin -ls
> Mount Table Entries:
> Source Destinations Owner Group Mode Quota/Usage
> /ns10t ns10->/ns10t hadp hadp rwxr-xr-x [NsQuota: 1200/1201, SsQuota: -/-]
> [hadp@hadoop]$ hdfs dfs -rm hdfs://ns-fed/ns10t/ns1mountpoint/aa.99
> rm: Failed to move to trash: hdfs://ns-fed/ns10t/ns1mountpoint/aa.99: 
> The NameSpace quota (directories and files) is exceeded: quota=1200 file 
> count=1201
> [hadp@hadoop]$ hdfs dfs -rm -skipTrash 
> hdfs://ns-fed/ns10t/ns1mountpoint/aa.99
> rm: The NameSpace quota (directories and files) is exceeded: quota=1200 file 
> count=1201
> {code}
> I think we should add a parameter for the method *getLocationsForPath,* to 
> determine if we need to perform quota verification on the operation. For 
> example mv src directory parameter and rm directory parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13380) RBF: mv/rm fail after the directory exceeded the quota limit

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13380:
-
Fix Version/s: (was: 3.0.3)

> RBF: mv/rm fail after the directory exceeded the quota limit
> 
>
> Key: HDFS-13380
> URL: https://issues.apache.org/jira/browse/HDFS-13380
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Weiwei Wu
>Assignee: Yiqun Lin
>Priority: Major
> Fix For: 2.10.0, 3.2.0
>
> Attachments: HDFS-13380.001.patch, HDFS-13380.002.patch
>
>
> It's always fail when I try to mv/rm a directory which have exceeded the 
> quota limit.
> {code:java}
> [hadp@hadoop]$ hdfs dfsrouteradmin -ls
> Mount Table Entries:
> Source Destinations Owner Group Mode Quota/Usage
> /ns10t ns10->/ns10t hadp hadp rwxr-xr-x [NsQuota: 1200/1201, SsQuota: -/-]
> [hadp@hadoop]$ hdfs dfs -rm hdfs://ns-fed/ns10t/ns1mountpoint/aa.99
> rm: Failed to move to trash: hdfs://ns-fed/ns10t/ns1mountpoint/aa.99: 
> The NameSpace quota (directories and files) is exceeded: quota=1200 file 
> count=1201
> [hadp@hadoop]$ hdfs dfs -rm -skipTrash 
> hdfs://ns-fed/ns10t/ns1mountpoint/aa.99
> rm: The NameSpace quota (directories and files) is exceeded: quota=1200 file 
> count=1201
> {code}
> I think we should add a parameter for the method *getLocationsForPath,* to 
> determine if we need to perform quota verification on the operation. For 
> example mv src directory parameter and rm directory parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13490) RBF: Fix setSafeMode in the Router

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13490:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Fix setSafeMode in the Router
> --
>
> Key: HDFS-13490
> URL: https://issues.apache.org/jira/browse/HDFS-13490
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.1
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13490.000.patch, HDFS-13490.001.patch
>
>
> RouterRpcServer doesn't handle the isChecked parameter correctly when 
> forwarding setSafeMode to the namenodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13380) RBF: mv/rm fail after the directory exceeded the quota limit

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13380:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: mv/rm fail after the directory exceeded the quota limit
> 
>
> Key: HDFS-13380
> URL: https://issues.apache.org/jira/browse/HDFS-13380
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Weiwei Wu
>Assignee: Yiqun Lin
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.0.3
>
> Attachments: HDFS-13380.001.patch, HDFS-13380.002.patch
>
>
> It's always fail when I try to mv/rm a directory which have exceeded the 
> quota limit.
> {code:java}
> [hadp@hadoop]$ hdfs dfsrouteradmin -ls
> Mount Table Entries:
> Source Destinations Owner Group Mode Quota/Usage
> /ns10t ns10->/ns10t hadp hadp rwxr-xr-x [NsQuota: 1200/1201, SsQuota: -/-]
> [hadp@hadoop]$ hdfs dfs -rm hdfs://ns-fed/ns10t/ns1mountpoint/aa.99
> rm: Failed to move to trash: hdfs://ns-fed/ns10t/ns1mountpoint/aa.99: 
> The NameSpace quota (directories and files) is exceeded: quota=1200 file 
> count=1201
> [hadp@hadoop]$ hdfs dfs -rm -skipTrash 
> hdfs://ns-fed/ns10t/ns1mountpoint/aa.99
> rm: The NameSpace quota (directories and files) is exceeded: quota=1200 file 
> count=1201
> {code}
> I think we should add a parameter for the method *getLocationsForPath,* to 
> determine if we need to perform quota verification on the operation. For 
> example mv src directory parameter and rm directory parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13462) Add BIND_HOST configuration for JournalNode's HTTP and RPC Servers

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13462:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> Add BIND_HOST configuration for JournalNode's HTTP and RPC Servers
> --
>
> Key: HDFS-13462
> URL: https://issues.apache.org/jira/browse/HDFS-13462
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, journal-node
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13462.000.patch, HDFS-13462.001.patch, 
> HDFS-13462.002.patch, HDFS-13462_branch-2.000.patch
>
>
> Allow configurable bind-host for JournalNode's HTTP and RPC servers to allow 
> overriding the hostname for which the server accepts connections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-13478) RBF: Disabled Nameservice store API

2018-05-08 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467849#comment-16467849
 ] 

Yongjun Zhang commented on HDFS-13478:
--

HI [~elgoiri], [~linyiqun],

Thanks for your work on RBF.

We don't have release 3.0.3 yet, so all the jiras whose Fix Verion/s are set to 
3.0.4 would really be 3.0.3. Wonder if you intend to put them into 3.0.4 
instead of 3.0.3 for the RBF fixes?

If you really meant for 3.0.3, we will need to change the Fix Version/s field 
of these jiras to 3.0.3. Would you please let me know ASAP?

Thanks.

> RBF: Disabled Nameservice store API
> ---
>
> Key: HDFS-13478
> URL: https://issues.apache.org/jira/browse/HDFS-13478
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.1
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.4
>
> Attachments: HDFS-13478.000.patch, HDFS-13478.001.patch, 
> HDFS-13478.002.patch, HDFS-13478.003.patch, HDFS-13478.004.patch, 
> HDFS-13478.005.patch
>
>
> We have a subcluster in our federation that is for testing and is 
> missbehaving. This has a negative impact on the performance with operations 
> that go to every subcluster (e.g., renewLease() or setSafeMode()).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13435) RBF: Improve the error loggings for printing the stack trace

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13435:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: Improve the error loggings for printing the stack trace
> 
>
> Key: HDFS-13435
> URL: https://issues.apache.org/jira/browse/HDFS-13435
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Affects Versions: 3.0.1
>Reporter: Yiqun Lin
>Assignee: Yiqun Lin
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13435.001.patch, HDFS-13435.002.patch, 
> HDFS-13435.003.patch
>
>
> There are many places that using {{Logger.error(String format, Object... 
> arguments)}} incorrectly.
>  A example:
> {code:java}
> LOG.error("Cannot remove {}", path, e);
> {code}
> The exception passed here is no meaning and won't be printed. Actually it 
> should be update to
> {code:java}
> LOG.error("Cannot remove {}: {}.", path, e.getMessage());
> {code}
> or 
> {code:java}
> LOG.error("Cannot remove " +  path, e));
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13453) RBF: getMountPointDates should fetch latest subdir time/date when parent dir is not present but /parent/child dirs are present in mount table

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13453:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: getMountPointDates should fetch latest subdir time/date when parent dir 
> is not present but /parent/child dirs are present in mount table
> -
>
> Key: HDFS-13453
> URL: https://issues.apache.org/jira/browse/HDFS-13453
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Dibyendu Karmakar
>Assignee: Dibyendu Karmakar
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13453-000.patch, HDFS-13453-001.patch, 
> HDFS-13453-002.patch, HDFS-13453-003.patch
>
>
> [HDFS-13386|https://issues.apache.org/jira/browse/HDFS-13386] is not handling 
> the case when /parent in not present in mount table but /parent/subdir is in 
> mount table.
> In this case getMountPointDates is not able to fetch the latest time for 
> /parent as /parent is not present in mount table.
> For this scenario we will display latest modified subdir date/time as /parent 
> modified time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13353) RBF: TestRouterWebHDFSContractCreate failed

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-13353:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> RBF: TestRouterWebHDFSContractCreate failed
> ---
>
> Key: HDFS-13353
> URL: https://issues.apache.org/jira/browse/HDFS-13353
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: test
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HDFS-13353.1.patch, HDFS-13353.2.patch, 
> HDFS-13353.3.patch
>
>
> {noformat}
> [ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 21.685 s <<< FAILURE! - in 
> org.apache.hadoop.fs.contract.router.web.TestRouterWebHDFSContractCreate
> [ERROR] 
> testCreatedFileIsVisibleOnFlush(org.apache.hadoop.fs.contract.router.web.TestRouterWebHDFSContractCreate)
>   Time elapsed: 0.147 s  <<< ERROR!
> java.io.FileNotFoundException: expected path to be visible before file 
> closed: not found 
> webhdfs://0.0.0.0:43796/test/testCreatedFileIsVisibleOnFlush in 
> webhdfs://0.0.0.0:43796/test
>   at 
> org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:936)
>   at 
> org.apache.hadoop.fs.contract.ContractTestUtils.assertPathExists(ContractTestUtils.java:914)
>   at 
> org.apache.hadoop.fs.contract.AbstractFSContractTestBase.assertPathExists(AbstractFSContractTestBase.java:294)
>   at 
> org.apache.hadoop.fs.contract.AbstractContractCreateTest.testCreatedFileIsVisibleOnFlush(AbstractContractCreateTest.java:254)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /test/testCreatedFileIsVisibleOnFlush
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toIOException(WebHdfsFileSystem.java:549)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$800(WebHdfsFileSystem.java:136)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.shouldRetry(WebHdfsFileSystem.java:877)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:843)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:642)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:680)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:676)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:1074)
>   at 
> org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:1085)
>   at 
> org.apache.hadoop.fs.contract.ContractTestUtils.verifyPathExists(ContractTestUtils.java:930)
>   ... 15 more
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):

[jira] [Updated] (HDFS-12981) renameSnapshot a Non-Existent snapshot to itself should throw error

2018-05-08 Thread Yongjun Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-12981:
-
Fix Version/s: (was: 3.0.4)
   3.0.3

> renameSnapshot a Non-Existent snapshot to itself should throw error
> ---
>
> Key: HDFS-12981
> URL: https://issues.apache.org/jira/browse/HDFS-12981
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 2.6.0
>Reporter: Sailesh Patel
>Assignee: Kitti Nanasi
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 3.1.1, 3.0.3
>
> Attachments: HDFS-12981-branch-2.6.0.001.patch, 
> HDFS-12981-branch-2.6.0.002.patch, HDFS-12981.001.patch, 
> HDFS-12981.002.patch, HDFS-12981.003.patch, HDFS-12981.004.patch
>
>
> When trying to rename a non-existent HDFS  snapshot to ITSELF, there are no 
> errors and exits with a success code.
> The steps to reproduce this issue is:
> hdfs dfs -mkdir /tmp/dir1
> hdfs dfsadmin -allowSnapshot /tmp/dir1
> hdfs dfs  -createSnapshot /tmp/dir1  snap1_dir
> Rename from non-existent to another_non-existent : errors and return code 1.  
> This is correct.
>   hdfs dfs -renameSnapshot /tmp/dir1 nonexist another_nonexist  : 
>   echo $?
>
>   renameSnapshot: The snapshot nonexist does not exist for directory /tmp/dir1
> Rename from non-existent to non-existent : no errors and return code 0  
> instead of Error and return code 1.
>   hdfs dfs -renameSnapshot /tmp/dir1 nonexist nonexist  ;  echo $?
> Current behavior:   No error and return code 0.
> Expected behavior:  An error returned and return code 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-13314) NameNode should optionally exit if it detects FsImage corruption

2018-03-30 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420978#comment-16420978
 ] 

Yongjun Zhang edited comment on HDFS-13314 at 3/30/18 9:55 PM:
---

{quote}From: Arpit Agarwal

Hi Yongjun, thanks for looking at the Jira! Please post your comments in the 
Jira also for support. 

 
 # Yes we saw duplicate entries.
 # The crash we saw was a NPE due to the referred INode being absent. The check 
looks for such dangling references. I don’t think we have seen a crash at the 
location you pointed out.

    private INodeReference loadINodeReference(

    INodeReferenceSection.INodeReference r) throws IOException {

  long referredId = r.getReferredId();

  INode referred = fsDir.getInode(referredId);

  *WithCount withCount = (WithCount) referred.getParentReference();   
<< Crashes here as referred is null.*
 # We have not seen misordered entries yet. Also, the *!misordered* check was 
deliberate. Once there is one such entry the whole list is compromised.
 # The Assertion actually results in a runtime exception which fails the 
request. However we suspect that the list was somehow corrupted by other means, 
not the insert call. We are not sure how it happened.

 

Let me know if you have any concerns or ideas for improving the checks. We can 
certainly do a follow up jira.
{quote}


was (Author: yzhangal):
{quote}

Hi Yongjun, thanks for looking at the Jira! Please post your comments in the 
Jira also for support. 

 
 # Yes we saw duplicate entries.
 # The crash we saw was a NPE due to the referred INode being absent. The check 
looks for such dangling references. I don’t think we have seen a crash at the 
location you pointed out.

    private INodeReference loadINodeReference(

    INodeReferenceSection.INodeReference r) throws IOException {

  long referredId = r.getReferredId();

  INode referred = fsDir.getInode(referredId);

  *WithCount withCount = (WithCount) referred.getParentReference();   
<< Crashes here as referred is null.*
 # We have not seen misordered entries yet. Also, the *!misordered* check was 
deliberate. Once there is one such entry the whole list is compromised.
 # The Assertion actually results in a runtime exception which fails the 
request. However we suspect that the list was somehow corrupted by other means, 
not the insert call. We are not sure how it happened.

 

Let me know if you have any concerns or ideas for improving the checks. We can 
certainly do a follow up jira.

{quote}

> NameNode should optionally exit if it detects FsImage corruption
> 
>
> Key: HDFS-13314
> URL: https://issues.apache.org/jira/browse/HDFS-13314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.2
>
> Attachments: HDFS-13314.01.patch, HDFS-13314.02.patch, 
> HDFS-13314.03.patch, HDFS-13314.04.patch, HDFS-13314.05.patch
>
>
> The NameNode should optionally exit after writing an FsImage if it detects 
> the following kinds of corruptions:
> # INodeReference pointing to non-existent INode
> # Duplicate entries in snapshot deleted diff list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2272 matches

Mail list logo