[jira] [Created] (HDFS-13763) HDFS Balancer - include path to only move blocks for a certain directory tree

2018-07-24 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-13763:
--

 Summary: HDFS Balancer - include path to only move blocks for a 
certain directory tree
 Key: HDFS-13763
 URL: https://issues.apache.org/jira/browse/HDFS-13763
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover
Affects Versions: 3.0.3
Reporter: Hari Sekhon


Improvement Request to add switch to HDFS Balancer to only move blocks 
belonging to files under given path(s).

This whitelist behaviour is useful to move things like HBase snapshots archive 
but not HBase live blocks itself which would ruin data locality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13740) Prometheus /metrics http endpoint for monitoring integration

2018-07-20 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13740:
---
Affects Version/s: 3.0.3

> Prometheus /metrics http endpoint for monitoring integration
> 
>
> Key: HDFS-13740
> URL: https://issues.apache.org/jira/browse/HDFS-13740
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: metrics
>Affects Versions: 2.7.3, 3.0.3
>Reporter: Hari Sekhon
>Priority: Major
>
> Feature Request to add Prometheus /metrics http endpoint for monitoring 
> integration:
> [https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E]
> Prometheus metrics format for that endpoint:
> [https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13740) Prometheus /metrics http endpoint for monitoring integration

2018-07-20 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13740:
---
Component/s: metrics

> Prometheus /metrics http endpoint for monitoring integration
> 
>
> Key: HDFS-13740
> URL: https://issues.apache.org/jira/browse/HDFS-13740
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: metrics
>Affects Versions: 2.7.3, 3.0.3
>Reporter: Hari Sekhon
>Priority: Major
>
> Feature Request to add Prometheus /metrics http endpoint for monitoring 
> integration:
> [https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E]
> Prometheus metrics format for that endpoint:
> [https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13740) Prometheus /metrics http endpoint for monitoring integration

2018-07-19 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13740:
---
Description: 
Feature Request to add Prometheus /metrics http endpoint for monitoring 
integration:

[https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E]

Prometheus metrics format for that endpoint:

[https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md]

 

  was:
Feature Request to add Prometheus /metrics http endpoint for monitoring 
integration:

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E


> Prometheus /metrics http endpoint for monitoring integration
> 
>
> Key: HDFS-13740
> URL: https://issues.apache.org/jira/browse/HDFS-13740
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Affects Versions: 2.7.3
>Reporter: Hari Sekhon
>Priority: Major
>
> Feature Request to add Prometheus /metrics http endpoint for monitoring 
> integration:
> [https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E]
> Prometheus metrics format for that endpoint:
> [https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13742) HDFS Upgrade Domains dynamic placement policy using dfsadmin commands and scripting API to be automation friendly for scripted datanode additions to clusters

2018-07-18 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13742:
---
Summary: HDFS Upgrade Domains dynamic placement policy using dfsadmin 
commands and scripting API to be automation friendly for scripted datanode 
additions to clusters  (was: HDFS Upgrade Domains dynamic policy with dfsadmin 
commands and scripting API to be automation friendly for scripted datanode 
additions to clusters)

> HDFS Upgrade Domains dynamic placement policy using dfsadmin commands and 
> scripting API to be automation friendly for scripted datanode additions to 
> clusters
> -
>
> Key: HDFS-13742
> URL: https://issues.apache.org/jira/browse/HDFS-13742
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, hdfs, namenode, rolling upgrades, 
> scripts, shell, tools
>Affects Versions: 3.1.0
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to change HDFS Upgrade Domains placement policy to be 
> dynamically configurable.
> Instead of using basic static JSON file, instead use dfsadmin commands and 
> script-able Rest API based management for better automation when scripting 
> datanode additions to a cluster.
> [http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13742) HDFS Upgrade Domains dynamic policy with dfsadmin commands and scripting API to be automation friendly for scripted datanode additions to clusters

2018-07-18 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13742:
---
Description: 
Improvement Request to change HDFS Upgrade Domains from instead of using basic 
JSON file, instead use dfsadmin command and script-able Rest API based 
management for better automation when scripting datanode additions to a cluster.

[http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]

  was:
Improvement Request to change HDFS Upgrade Domains from basic JSON file to 
online command and script-able Rest API based management for better automation 
when scripting datanode additions to a cluster.

[http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]


> HDFS Upgrade Domains dynamic policy with dfsadmin commands and scripting API 
> to be automation friendly for scripted datanode additions to clusters
> --
>
> Key: HDFS-13742
> URL: https://issues.apache.org/jira/browse/HDFS-13742
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, hdfs, namenode, rolling upgrades, 
> scripts, shell, tools
>Affects Versions: 3.1.0
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to change HDFS Upgrade Domains from instead of using 
> basic JSON file, instead use dfsadmin command and script-able Rest API based 
> management for better automation when scripting datanode additions to a 
> cluster.
> [http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13742) HDFS Upgrade Domains dynamic policy with dfsadmin commands and scripting API to be automation friendly for scripted datanode additions to clusters

2018-07-18 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13742:
---
Description: 
Improvement Request to change HDFS Upgrade Domains placement policy to be 
dynamically configurable.

Instead of using basic static JSON file, instead use dfsadmin commands and 
script-able Rest API based management for better automation when scripting 
datanode additions to a cluster.

[http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]

  was:
Improvement Request to change HDFS Upgrade Domains from instead of using basic 
JSON file, instead use dfsadmin command and script-able Rest API based 
management for better automation when scripting datanode additions to a cluster.

[http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]


> HDFS Upgrade Domains dynamic policy with dfsadmin commands and scripting API 
> to be automation friendly for scripted datanode additions to clusters
> --
>
> Key: HDFS-13742
> URL: https://issues.apache.org/jira/browse/HDFS-13742
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, hdfs, namenode, rolling upgrades, 
> scripts, shell, tools
>Affects Versions: 3.1.0
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to change HDFS Upgrade Domains placement policy to be 
> dynamically configurable.
> Instead of using basic static JSON file, instead use dfsadmin commands and 
> script-able Rest API based management for better automation when scripting 
> datanode additions to a cluster.
> [http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13742) HDFS Upgrade Domains dynamic policy with dfsadmin commands and scripting API to be automation friendly for scripted datanode additions etc

2018-07-18 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13742:
---
Summary: HDFS Upgrade Domains dynamic policy with dfsadmin commands and 
scripting API to be automation friendly for scripted datanode additions etc  
(was: HDFS Upgrade Domains dynamic policy with commands and scripting API to be 
automation friendly for scripted datanodes additions etc)

> HDFS Upgrade Domains dynamic policy with dfsadmin commands and scripting API 
> to be automation friendly for scripted datanode additions etc
> --
>
> Key: HDFS-13742
> URL: https://issues.apache.org/jira/browse/HDFS-13742
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, hdfs, namenode, rolling upgrades, 
> scripts, shell, tools
>Affects Versions: 3.1.0
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to change HDFS Upgrade Domains from basic JSON file to 
> online command and script-able Rest API based management for better 
> automation when scripting datanode additions to a cluster.
> [http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13742) HDFS Upgrade Domains dynamic policy with dfsadmin commands and scripting API to be automation friendly for scripted datanode additions to clusters

2018-07-18 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13742:
---
Summary: HDFS Upgrade Domains dynamic policy with dfsadmin commands and 
scripting API to be automation friendly for scripted datanode additions to 
clusters  (was: HDFS Upgrade Domains dynamic policy with dfsadmin commands and 
scripting API to be automation friendly for scripted datanode additions etc)

> HDFS Upgrade Domains dynamic policy with dfsadmin commands and scripting API 
> to be automation friendly for scripted datanode additions to clusters
> --
>
> Key: HDFS-13742
> URL: https://issues.apache.org/jira/browse/HDFS-13742
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, hdfs, namenode, rolling upgrades, 
> scripts, shell, tools
>Affects Versions: 3.1.0
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to change HDFS Upgrade Domains from basic JSON file to 
> online command and script-able Rest API based management for better 
> automation when scripting datanode additions to a cluster.
> [http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13742) HDFS Upgrade Domains dynamic policy with commands and scripting API to be automation friendly for scripted datanodes additions etc

2018-07-18 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13742:
---
Summary: HDFS Upgrade Domains dynamic policy with commands and scripting 
API to be automation friendly for scripted datanodes additions etc  (was: Make 
HDFS Upgrade Domains more dynamic / automation friendly for scripted additions 
when adding datanodes etc)

> HDFS Upgrade Domains dynamic policy with commands and scripting API to be 
> automation friendly for scripted datanodes additions etc
> --
>
> Key: HDFS-13742
> URL: https://issues.apache.org/jira/browse/HDFS-13742
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, hdfs, namenode, rolling upgrades, 
> scripts, shell, tools
>Affects Versions: 3.1.0
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to change HDFS Upgrade Domains from basic JSON file to 
> online command and script-able Rest API based management for better 
> automation when scripting datanode additions to a cluster.
> [http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13742) Make HDFS Upgrade Domains more dynamic / automation friendly for scripted additions when adding datanodes etc

2018-07-18 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-13742:
--

 Summary: Make HDFS Upgrade Domains more dynamic / automation 
friendly for scripted additions when adding datanodes etc
 Key: HDFS-13742
 URL: https://issues.apache.org/jira/browse/HDFS-13742
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, hdfs, namenode, rolling upgrades, 
scripts, shell, tools
Affects Versions: 3.1.0
Reporter: Hari Sekhon


Improvement Request to change HDFS Upgrade Domains from basic JSON file to 
online command and script-able Rest API based management for better automation 
when scripting datanode additions to a cluster.

[http://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across

2018-07-18 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547572#comment-16547572
 ] 

Hari Sekhon commented on HDFS-13739:


HDFS-7541 helps with the first issue although not the second and isn't as 
simple a solution as disabling rack local write and enforcing rack remote 
writes for 2nd and 3rd replicas.

> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> ---
>
> Key: HDFS-13739
> URL: https://issues.apache.org/jira/browse/HDFS-13739
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, datanode, fs, 
> hdfs, hdfs-client, namenode, nn, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Request to be able to disable Rack Local Write preference / Write All 
> Replicas to different Racks.
> Current HDFS write pattern of "local node, rack local node, other rack node" 
> is good for most purposes but there are at least 2 scenarios where this is 
> not ideal:
>  # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
> replica. If a single datanode failed it would likely cause some data outage 
> or even data loss if the rack is lost or an upgrade fails (or perhaps it's a 
> rack rebuild). Setting replicas to 4 would reduce write performance and waste 
> storage which is currently the only workaround to that issue.
>  # Major Storage Imbalance across datanodes when there is an uneven layout of 
> datanodes across racks - some nodes fill up while others are half empty.
> I have observed this storage imbalance on a cluster where half the nodes were 
> 85% full and the other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same rack 
> will only choose to send half their block replicas to each other, so they 
> will fill up first, while other nodes will receive far fewer replica blocks:
> {code:java}
> NumNodes - Rack 
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4 
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost 
> perfect spread of blocks across all datanodes because HDFS has no choice but 
> to maintain the only 2nd replica on a different rack. If I increase the 
> replicas back to 3 it goes back to 85% on half the nodes and 50% on the other 
> half, because the extra replicas choose to replicate only to rack local nodes.
> Why not just run the HDFS balancer to fix it you might say? This is a heavily 
> loaded HBase cluster - aside from destroying HBase's data locality and 
> performance by moving blocks out from underneath RegionServers - as soon as 
> an HBase major compaction occurs (at least weekly), all blocks will get 
> re-written by HBase and the HDFS client will again write to local node, rack 
> local node, other rack node - resulting in the same storage imbalance again. 
> Hence this cannot be solved by running HDFS balancer on HBase clusters - or 
> for any application sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13740) Prometheus /metrics http endpoint for monitoring integration

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13740:
---
Summary: Prometheus /metrics http endpoint for monitoring integration  
(was: Prometheus /metrics http endpoint for metrics monitoring integration)

> Prometheus /metrics http endpoint for monitoring integration
> 
>
> Key: HDFS-13740
> URL: https://issues.apache.org/jira/browse/HDFS-13740
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Affects Versions: 2.7.3
>Reporter: Hari Sekhon
>Priority: Major
>
> Feature Request to add Prometheus /metrics http endpoint for monitoring 
> integration:
> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13740) Prometheus /metrics http endpoint for metrics monitoring integration

2018-07-17 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-13740:
--

 Summary: Prometheus /metrics http endpoint for metrics monitoring 
integration
 Key: HDFS-13740
 URL: https://issues.apache.org/jira/browse/HDFS-13740
 Project: Hadoop HDFS
  Issue Type: New Feature
Affects Versions: 2.7.3
Reporter: Hari Sekhon


Feature Request to add Prometheus /metrics http endpoint for monitoring 
integration:

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cscrape_config%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Request to be able to disable Rack Local Write preference / Send All Replicas 
to different Racks.

Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single datanode failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node - resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single datanode failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node - resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> 

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Request to be able to disable Rack Local Write preference / Write All Replicas 
to different Racks.

Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single datanode failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node - resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Request to be able to disable Rack Local Write preference / Send All Replicas 
to different Racks.

Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single datanode failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node - resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> 

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single datanode failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node - resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node - resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> ---
>
> Key: 

[jira] [Comment Edited] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance ac

2018-07-17 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546265#comment-16546265
 ] 

Hari Sekhon edited comment on HDFS-13739 at 7/17/18 10:54 AM:
--

There are a couple of other good examples of where data placement control is of 
benefit which I am linking, and definitely something major that can be improved 
upon in HDFS.

Perhaps this is something that needs to be solved once more generically as 
overall data placement control.

However for the scenarios described above, it's sufficient to just ensure all 
replicas go to different racks.


was (Author: harisekhon):
There are a couple of other good examples of where data placement control is of 
benefit which I am linking, and definitely something major that can be improved 
upon in HDFS.

Perhaps this is something that needs to be solved once more generically.

> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> ---
>
> Key: HDFS-13739
> URL: https://issues.apache.org/jira/browse/HDFS-13739
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, datanode, fs, 
> hdfs, hdfs-client, namenode, nn, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Current HDFS write pattern of "local node, rack local node, other rack node" 
> is good for most purposes but there are at least 2 scenarios where this is 
> not ideal:
>  # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
> replica. If a single data node failed it would likely cause some data outage 
> or even data loss if the rack is lost or an upgrade fails (or perhaps it's a 
> rack rebuild). Setting replicas to 4 would reduce write performance and waste 
> storage which is currently the only workaround to that issue.
>  # Major Storage Imbalance across datanodes when there is an uneven layout of 
> datanodes across racks - some nodes fill up while others are half empty.
> I have observed this storage imbalance on a cluster where half the nodes were 
> 85% full and the other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same rack 
> will only choose to send half their block replicas to each other, so they 
> will fill up first, while other nodes will receive far fewer replica blocks:
> {code:java}
> NumNodes - Rack 
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4 
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost 
> perfect spread of blocks across all datanodes because HDFS has no choice but 
> to maintain the only 2nd replica on a different rack. If I increase the 
> replicas back to 3 it goes back to 85% on half the nodes and 50% on the other 
> half, because the extra replicas choose to replicate only to rack local nodes.
> Why not just run the HDFS balancer to fix it you might say? This is a heavily 
> loaded HBase cluster - aside from destroying HBase's data locality and 
> performance by moving blocks out from underneath RegionServers - as soon as 
> an HBase major compaction occurs (at least weekly), all blocks will get 
> re-written by HBase and the HDFS client will again write to local node, rack 
> local node, other rack node - resulting in the same storage imbalance again. 
> Hence this cannot be solved by running HDFS balancer on HBase clusters - or 
> for any application sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node - resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> ---
>
> 

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imbalance across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> ---
>
> 

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (or perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> ---
>
> 

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (perhaps it's a rack 
rebuild). Setting replicas to 4 would reduce write performance and waste 
storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (perhaps it's a complete 
rebuild upgrade). Setting replicas to 4 would reduce write performance and 
waste storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> ---
>
>

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance leaves data at risk of losing last remaining 
replica. If a single data node failed it would likely cause some data outage or 
even data loss if the rack is lost or an upgrade fails (perhaps it's a complete 
rebuild upgrade). Setting replicas to 4 would reduce write performance and 
waste storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance / Upgrades leave data at risk of losing last 
remaining replica. If a single data node failed it would likely cause some data 
outage or even data loss if the rack is lost or the upgrade fails (perhaps it's 
a complete rebuild upgrade). Setting replicas to 4 would reduce write 
performance and waste storage which is currently the only workaround to that 
issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> 

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Maintenance / Upgrades leave data at risk of losing last 
remaining replica. If a single data node failed it would likely cause some data 
outage or even data loss if the rack is lost or the upgrade fails (perhaps it's 
a complete rebuild upgrade). Setting replicas to 4 would reduce write 
performance and waste storage which is currently the only workaround to that 
issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Upgrades at pose risk of losing last remaining replica. If a 
single data node failure happened it would likely cause some data outage or 
even data loss if the rack is lost or the upgrade fails (perhaps it's a 
complete rebuild upgrade). Setting replicas to 4 would reduce write performance 
and waste storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> 

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major Storage Imbalance across Da

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Summary: Option to disable Rack Local Write Preference to avoid 2 issues - 
1. Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
Storage Imbalance across DataNodes caused by uneven spread of Datanodes across 
Racks  (was: Option to disable Rack Local Write Preference to avoid 2 issues - 
1. Rack-by-Rack Maintenance risks losing last remaining replica, 2. avoid Major 
Storage Imbalance across DataNodes caused by uneven spread of Datanodes across 
Racks)

> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance leaves last data replica at risk, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> ---
>
> Key: HDFS-13739
> URL: https://issues.apache.org/jira/browse/HDFS-13739
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, datanode, fs, 
> hdfs, hdfs-client, namenode, nn, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Current HDFS write pattern of "local node, rack local node, other rack node" 
> is good for most purposes but there are at least 2 scenarios where this is 
> not ideal:
>  # Rack-by-Rack Upgrades at pose risk of losing last remaining replica. If a 
> single data node failure happened it would likely cause some data outage or 
> even data loss if the rack is lost or the upgrade fails (perhaps it's a 
> complete rebuild upgrade). Setting replicas to 4 would reduce write 
> performance and waste storage which is currently the only workaround to that 
> issue.
>  # Major Storage Imabalnce across datanodes when there is an uneven layout of 
> datanodes across racks - some nodes fill up while others are half empty.
> I have observed this storage imbalance on a cluster where half the nodes were 
> 85% full and the other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same rack 
> will only choose to send half their block replicas to each other, so they 
> will fill up first, while other nodes will receive far fewer replica blocks:
> {code:java}
> NumNodes - Rack 
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4 
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost 
> perfect spread of blocks across all datanodes because HDFS has no choice but 
> to maintain the only 2nd replica on a different rack. If I increase the 
> replicas back to 3 it goes back to 85% on half the nodes and 50% on the other 
> half, because the extra replicas choose to replicate only to rack local nodes.
> Why not just run the HDFS balancer to fix it you might say? This is a heavily 
> loaded HBase cluster - aside from destroying HBase's data locality and 
> performance by moving blocks out from underneath RegionServers - as soon as 
> an HBase major compaction occurs (at least weekly), all blocks will get 
> re-written by HBase and the HDFS client will again write to local node, rack 
> local node, other rack node and resulting in the same storage imbalance 
> again. Hence this cannot be solved by running HDFS balancer on HBase clusters 
> - or for any application sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance risks losing last remaining replica, 2. avoid Major Storage Imbalance across

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Upgrades at pose risk of losing last remaining replica. If a 
single data node failure happened it would likely cause some data outage or 
even data loss if the rack is lost or the upgrade fails (perhaps it's a 
complete rebuild upgrade). Setting replicas to 4 would reduce write performance 
and waste storage which is currently the only workaround to that issue.
 # Major Storage Imabalnce across datanodes when there is an uneven layout of 
datanodes across racks - some nodes fill up while others are half empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Upgrades at pose risk of losing last remaining replica. If a 
single data node failure happened it would likely cause some data outage or 
even data loss if the rack is lost or the upgrade fails (perhaps it's a 
complete rebuild upgrade). Setting replicas to 4 would reduce write performance 
and waste storage which is currently the only workaround to that issue.
 # when there is an uneven layout of datanodes across racks it can cause major 
storage imbalance across nodes with some nodes filling up and others being half 
empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance risks losing last remaining replica, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> 

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - 1. Rack-by-Rack Maintenance risks losing last remaining replica, 2. avoid Major Storage Imbalance across

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Summary: Option to disable Rack Local Write Preference to avoid 2 issues - 
1. Rack-by-Rack Maintenance risks losing last remaining replica, 2. avoid Major 
Storage Imbalance across DataNodes caused by uneven spread of Datanodes across 
Racks  (was: Option to disable Rack Local Write Preference to avoid 2 issues - 
Whole Rack Maintenance without risk of only 1 remaining replica, and avoid 
Major Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
across Racks)

> Option to disable Rack Local Write Preference to avoid 2 issues - 1. 
> Rack-by-Rack Maintenance risks losing last remaining replica, 2. avoid Major 
> Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
> across Racks
> --
>
> Key: HDFS-13739
> URL: https://issues.apache.org/jira/browse/HDFS-13739
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, datanode, fs, 
> hdfs, hdfs-client, namenode, nn, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Current HDFS write pattern of "local node, rack local node, other rack node" 
> is good for most purposes but there are at least 2 scenarios where this is 
> not ideal:
>  # Rack-by-Rack Upgrades at pose risk of losing last remaining replica. If a 
> single data node failure happened it would likely cause some data outage or 
> even data loss if the rack is lost or the upgrade fails (perhaps it's a 
> complete rebuild upgrade). Setting replicas to 4 would reduce write 
> performance and waste storage which is currently the only workaround to that 
> issue.
>  # when there is an uneven layout of datanodes across racks it can cause 
> major storage imbalance across nodes with some nodes filling up and others 
> being half empty.
> I have observed this storage imbalance on a cluster where half the nodes were 
> 85% full and the other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same rack 
> will only choose to send half their block replicas to each other, so they 
> will fill up first, while other nodes will receive far fewer replica blocks:
> {code:java}
> NumNodes - Rack 
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4 
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost 
> perfect spread of blocks across all datanodes because HDFS has no choice but 
> to maintain the only 2nd replica on a different rack. If I increase the 
> replicas back to 3 it goes back to 85% on half the nodes and 50% on the other 
> half, because the extra replicas choose to replicate only to rack local nodes.
> Why not just run the HDFS balancer to fix it you might say? This is a heavily 
> loaded HBase cluster - aside from destroying HBase's data locality and 
> performance by moving blocks out from underneath RegionServers - as soon as 
> an HBase major compaction occurs (at least weekly), all blocks will get 
> re-written by HBase and the HDFS client will again write to local node, rack 
> local node, other rack node and resulting in the same storage imbalance 
> again. Hence this cannot be solved by running HDFS balancer on HBase clusters 
> - or for any application sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - Whole Rack Maintenance without risk of only 1 remaining replica, and avoid Major Storage Imbalance acros

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Upgrades at pose risk of losing last remaining replica. If a 
single data node failure happened it would likely cause some data outage or 
even data loss if the rack is lost or the upgrade fails (perhaps it's a 
complete rebuild upgrade). Setting replicas to 4 would reduce write performance 
and waste storage which is currently the only workaround to that issue.
 # when there is an uneven layout of datanodes across racks it can cause major 
storage imbalance across nodes with some nodes filling up and others being half 
empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but when there is an uneven layout of datanodes across 
racks it can cause major storage imbalance across nodes with some nodes filling 
up and others being half empty.

I have observed this on a cluster where half the nodes were 85% full and the 
other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.


 Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - Whole Rack 
> Maintenance without risk of only 1 remaining replica, and avoid Major Storage 
> Imbalance across DataNodes caused by uneven spread of Datanodes across Racks
> ---
>
> Key: HDFS-13739
> URL: https://issues.apache.org/jira/browse/HDFS-13739
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, datanode, fs, 
> hdfs, hdfs-client, namenode, nn, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>

[jira] [Updated] (HDFS-13739) Option to disable Rack Local Write Preference to avoid 2 issues - Whole Rack Maintenance without risk of only 1 remaining replica, and avoid Major Storage Imbalance acros

2018-07-17 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
---
Summary: Option to disable Rack Local Write Preference to avoid 2 issues - 
Whole Rack Maintenance without risk of only 1 remaining replica, and avoid 
Major Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
across Racks  (was: Option to disable Rack Local Write Preference to avoid 
Major Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
across Racks)

> Option to disable Rack Local Write Preference to avoid 2 issues - Whole Rack 
> Maintenance without risk of only 1 remaining replica, and avoid Major Storage 
> Imbalance across DataNodes caused by uneven spread of Datanodes across Racks
> ---
>
> Key: HDFS-13739
> URL: https://issues.apache.org/jira/browse/HDFS-13739
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, datanode, fs, 
> hdfs, hdfs-client, namenode, nn, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Current HDFS write pattern of "local node, rack local node, other rack node" 
> is good for most purposes but when there is an uneven layout of datanodes 
> across racks it can cause major storage imbalance across nodes with some 
> nodes filling up and others being half empty.
> I have observed this on a cluster where half the nodes were 85% full and the 
> other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same rack 
> will only choose to send half their block replicas to each other, so they 
> will fill up first, while other nodes will receive far fewer replica blocks:
> {code:java}
> NumNodes - Rack 
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4 
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost 
> perfect spread of blocks across all datanodes because HDFS has no choice but 
> to maintain the only 2nd replica on a different rack. If I increase the 
> replicas back to 3 it goes back to 85% on half the nodes and 50% on the other 
> half, because the extra replicas choose to replicate only to rack local nodes.
>  Why not just run the HDFS balancer to fix it you might say? This is a 
> heavily loaded HBase cluster - aside from destroying HBase's data locality 
> and performance by moving blocks out from underneath RegionServers - as soon 
> as an HBase major compaction occurs (at least weekly), all blocks will get 
> re-written by HBase and the HDFS client will again write to local node, rack 
> local node, other rack node and resulting in the same storage imbalance 
> again. Hence this cannot be solved by running HDFS balancer on HBase clusters 
> - or for any application sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13739) Option to disable Rack Local Write Preference to avoid Major Storage Imbalance across DataNodes caused by uneven spread of Datanodes across Racks

2018-07-17 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546265#comment-16546265
 ] 

Hari Sekhon commented on HDFS-13739:


There are a couple of other good examples of where data placement control is of 
benefit which I am linking, and definitely something major that can be improved 
upon in HDFS.

Perhaps this is something that needs to be solved once more generically.

> Option to disable Rack Local Write Preference to avoid Major Storage 
> Imbalance across DataNodes caused by uneven spread of Datanodes across Racks
> -
>
> Key: HDFS-13739
> URL: https://issues.apache.org/jira/browse/HDFS-13739
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, datanode, fs, 
> hdfs, hdfs-client, namenode, nn, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Current HDFS write pattern of "local node, rack local node, other rack node" 
> is good for most purposes but when there is an uneven layout of datanodes 
> across racks it can cause major storage imbalance across nodes with some 
> nodes filling up and others being half empty.
> I have observed this on a cluster where half the nodes were 85% full and the 
> other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same rack 
> will only choose to send half their block replicas to each other, so they 
> will fill up first, while other nodes will receive far fewer replica blocks:
> {code:java}
> NumNodes - Rack 
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4 
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost 
> perfect spread of blocks across all datanodes because HDFS has no choice but 
> to maintain the only 2nd replica on a different rack. If I increase the 
> replicas back to 3 it goes back to 85% on half the nodes and 50% on the other 
> half, because the extra replicas choose to replicate only to rack local nodes.
>  Why not just run the HDFS balancer to fix it you might say? This is a 
> heavily loaded HBase cluster - aside from destroying HBase's data locality 
> and performance by moving blocks out from underneath RegionServers - as soon 
> as an HBase major compaction occurs (at least weekly), all blocks will get 
> re-written by HBase and the HDFS client will again write to local node, rack 
> local node, other rack node and resulting in the same storage imbalance 
> again. Hence this cannot be solved by running HDFS balancer on HBase clusters 
> - or for any application sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13739) Option to disable Rack Local Write Preference to avoid Major Storage Imbalance across DataNodes caused by uneven spread of Datanodes across Racks

2018-07-17 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-13739:
--

 Summary: Option to disable Rack Local Write Preference to avoid 
Major Storage Imbalance across DataNodes caused by uneven spread of Datanodes 
across Racks
 Key: HDFS-13739
 URL: https://issues.apache.org/jira/browse/HDFS-13739
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, block placement, datanode, fs, 
hdfs, hdfs-client, namenode, nn, performance
Affects Versions: 2.7.3
 Environment: Hortonworks HDP 2.6
Reporter: Hari Sekhon


Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but when there is an uneven layout of datanodes across 
racks it can cause major storage imbalance across nodes with some nodes filling 
up and others being half empty.

I have observed this on a cluster where half the nodes were 85% full and the 
other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.


 Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13724) Storage Tiering Show Paths with Policies applied

2018-07-09 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536726#comment-16536726
 ] 

Hari Sekhon edited comment on HDFS-13724 at 7/9/18 9:36 AM:


I would guess that something along the lines of this would be more intuitive to 
users:
{code:java}
hdfs storagepolicies -listPaths{code}
as that would be more inline with the already existing command which lists just 
the policies without paths:
{code:java}
hdfs storagepolicies -listPolicies{code}
Alternatively it could be a switch to -listPolicies such as (in a similar 
fashion to what is done with hdfs fsck files blocks locations switches):
{code:java}
hdfs storagepolicies -listPolicies -paths{code}
 

[~brahmareddy]
{code:java}
grep -i storagepolicy fsimage.xml{code}
returns no hits. I actually had a colleague double check this for me last week 
too, dumped all xml tags to sort uniq and there was no such tag or anything 
that looked related, but there are definitely storage policies applied as my 
hortonworks colleague who configured this one told me the path so I ran the 
following command which showed the policy:
{code:java}
hdfs storagepolicies -getStoragePolicy -path /data/blah
The storage policy of /data/blah:
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], 
replicationFallbacks=[]}
{code}
It was actually that discussion with him that made me realise that if he wasn't 
around I wouldn't have been able to find the paths which had storage policies 
applied to them, something that hadn't occurred to me the last time I was 
configuring this myself.


was (Author: harisekhon):
I would guess that something along the lines of this would be more intuitive to 
users:
{code:java}
hdfs storagepolicies -listPaths{code}
as that would be more inline with the already existing command which lists just 
the policies without paths:
{code:java}
hdfs storagepolicies -listPolicies{code}
Alternatively it could be a switch to -listPolicies such as (in a similar 
fashion to what is done with hdfs fsck files blocks locations switches):
{code:java}
hdfs storagepolicies -listPolicies -paths{code}
 

[~brahmareddy]
{code:java}
grep -i storagepolicy fsimage.xml{code}
returns no hits. I actually had a colleague double check this for me last week 
too, dumped all xml tags to sort uniq and there was no such tag or anything 
that looked related, but there are definitely storage policies applied as my 
hortonworks colleague who configured this one told me the path and the 
following command returns the policy:
{code:java}
hdfs storagepolicies -getStoragePolicy -path /data/blah
The storage policy of /data/blah:
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], 
replicationFallbacks=[]}
{code}

> Storage Tiering Show Paths with Policies applied
> 
>
> Key: HDFS-13724
> URL: https://issues.apache.org/jira/browse/HDFS-13724
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Hari Sekhon
>Assignee: Yuanbo Liu
>Priority: Major
>
> Improvement Request to add an hdfs storagepolicies command to find paths for 
> which storage policies have been applied.
> Right now you must explicitly query a single directory to get its policy to 
> determine if one has been applied, but if another hadoop admin has configured 
> policies on anything but trivially obvious paths such as /archive then there 
> is no way to find which paths have policies applied to them other than by 
> querying every single directory and subdirectory one by one which might 
> potentially have a policy, eg:
> {code:java}
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir2
> hdfs storagepolicies -getStoragePolicy -path /dir3
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir2
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir3
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN/subdirN/subsubdirN
> ...{code}
> In my current environment for example, a policy was configured for /data/blah 
> which doesn't show when trying
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data{code}
> and I had no way of knowing that I had to do:
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data/blah{code}
> other than trial and error of trying every directory and every subdirectory 
> in hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13724) Storage Tiering Show Paths with Policies applied

2018-07-09 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536726#comment-16536726
 ] 

Hari Sekhon edited comment on HDFS-13724 at 7/9/18 9:34 AM:


I would guess that something along the lines of this would be more intuitive to 
users:
{code:java}
hdfs storagepolicies -listPaths{code}
as that would be more inline with the already existing command which lists just 
the policies without paths:
{code:java}
hdfs storagepolicies -listPolicies{code}
Alternatively it could be a switch to -listPolicies such as (in a similar 
fashion to what is done with hdfs fsck files blocks locations switches):
{code:java}
hdfs storagepolicies -listPolicies -paths{code}
 

[~brahmareddy]
{code:java}
grep -i storagepolicy fsimage.xml{code}
returns no hits. I actually had a colleague double check this for me last week 
too, dumped all xml tags to sort uniq and there was no such tag or anything 
that looked related, but there are definitely storage policies applied as my 
hortonworks colleague who configured this one told me the path and the 
following command returns the policy:
{code:java}
hdfs storagepolicies -getStoragePolicy -path /data/blah
The storage policy of /data/blah:
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], 
replicationFallbacks=[]}
{code}


was (Author: harisekhon):
I would guess that something along the lines of this would be more intuitive to 
users:
{code:java}
hdfs storagepolicies -listPaths{code}
as that would be more inline with the already existing command which lists just 
the policies without paths:
{code:java}
hdfs storagepolicies -listPolicies{code}
Alternatively it could be a switch to -listPolicies such as (in a similar 
fashion to what is done with hdfs fsck files blocks locations switches):
{code:java}
hdfs storagepolicies -listPolicies -paths{code}
 

[~brahmareddy] grep -i storagepolicy fsimage.xml returns no hits. I actually 
had a colleague double check this for me last week too, dumped all xml tags to 
sort uniq and there was no such tag or anything that looked related, but there 
are definitely storage policies applied as my hortonworks colleague who 
configured this one told me the path and the following returns the policy
{code:java}
hdfs storagepolicies -getStoragePolicy -path /data/blah
The storage policy of /data/blah:
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], 
replicationFallbacks=[]}
{code}

> Storage Tiering Show Paths with Policies applied
> 
>
> Key: HDFS-13724
> URL: https://issues.apache.org/jira/browse/HDFS-13724
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Hari Sekhon
>Assignee: Yuanbo Liu
>Priority: Major
>
> Improvement Request to add an hdfs storagepolicies command to find paths for 
> which storage policies have been applied.
> Right now you must explicitly query a single directory to get its policy to 
> determine if one has been applied, but if another hadoop admin has configured 
> policies on anything but trivially obvious paths such as /archive then there 
> is no way to find which paths have policies applied to them other than by 
> querying every single directory and subdirectory one by one which might 
> potentially have a policy, eg:
> {code:java}
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir2
> hdfs storagepolicies -getStoragePolicy -path /dir3
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir2
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir3
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN/subdirN/subsubdirN
> ...{code}
> In my current environment for example, a policy was configured for /data/blah 
> which doesn't show when trying
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data{code}
> and I had no way of knowing that I had to do:
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data/blah{code}
> other than trial and error of trying every directory and every subdirectory 
> in hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13724) Storage Tiering Show Paths with Policies applied

2018-07-09 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536726#comment-16536726
 ] 

Hari Sekhon commented on HDFS-13724:


I would guess that something along the lines of this would be more intuitive to 
users:
{code:java}
hdfs storagepolicies -listPaths{code}
as that would be more inline with the already existing command which lists just 
the policies without paths:
{code:java}
hdfs storagepolicies -listPolicies{code}
Alternatively it could be a switch to -listPolicies such as (in a similar 
fashion to what is done with hdfs fsck files blocks locations switches):
{code:java}
hdfs storagepolicies -listPolicies -paths{code}
 

[~brahmareddy] grep -i storagepolicy fsimage.xml returns no hits. I actually 
had a colleague double check this for me last week too, dumped all xml tags to 
sort uniq and there was no such tag or anything that looked related, but there 
are definitely storage policies applied as my hortonworks colleague who 
configured this one told me the path and the following returns the policy
{code:java}
hdfs storagepolicies -getStoragePolicy -path /data/blah
The storage policy of /data/blah:
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], 
replicationFallbacks=[]}
{code}

> Storage Tiering Show Paths with Policies applied
> 
>
> Key: HDFS-13724
> URL: https://issues.apache.org/jira/browse/HDFS-13724
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Hari Sekhon
>Assignee: Yuanbo Liu
>Priority: Major
>
> Improvement Request to add an hdfs storagepolicies command to find paths for 
> which storage policies have been applied.
> Right now you must explicitly query a single directory to get its policy to 
> determine if one has been applied, but if another hadoop admin has configured 
> policies on anything but trivially obvious paths such as /archive then there 
> is no way to find which paths have policies applied to them other than by 
> querying every single directory and subdirectory one by one which might 
> potentially have a policy, eg:
> {code:java}
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir2
> hdfs storagepolicies -getStoragePolicy -path /dir3
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir2
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir3
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN/subdirN/subsubdirN
> ...{code}
> In my current environment for example, a policy was configured for /data/blah 
> which doesn't show when trying
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data{code}
> and I had no way of knowing that I had to do:
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data/blah{code}
> other than trial and error of trying every directory and every subdirectory 
> in hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13724) Storage Tiering Show Paths with Policies applied

2018-07-06 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534859#comment-16534859
 ] 

Hari Sekhon edited comment on HDFS-13724 at 7/6/18 2:26 PM:


I tried a workaround of dumping the fsimage to xml and grepping for info:
{code:java}
su - hdfs
kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
hdfs dfsadmin -fetchImage .
# this step might take a long time on big clusters (eg. 20 mins for 12GB 
fsimage.xml result file from a moderate 600TB cluster)
hadoop oiv -i $(ls -tr fsimage_* | tail -n1) -p XML -o fsimage.xml
grep ... fsimage.xml{code}
but I can't find anything relating to 'policy' or the name of our storage 
policy or the directory I know it's applied to.


was (Author: harisekhon):
I tried a workaround of dupming the for now is to do the following as hdfs 
superuser - dump the fsimage, convert to XML and then grep the tiers path info:
{code:java}
su - hdfs
kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
hdfs dfsadmin -fetchImage .
# this step might take a long time on big clusters (eg. 20 mins for 12GB 
fsimage.xml result file from a moderate 600TB cluster)
hadoop oiv -i $(ls -tr fsimage_* | tail -n1) -p XML -o fsimage.xml
grep ... fsimage.xml{code}
but I can't find anything relating to 'policy' or the name of our storage 
policy or the directory I know it's applied to.

> Storage Tiering Show Paths with Policies applied
> 
>
> Key: HDFS-13724
> URL: https://issues.apache.org/jira/browse/HDFS-13724
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to add an hdfs storagepolicies command to find paths for 
> which storage policies have been applied.
> Right now you must explicitly query a single directory to get its policy to 
> determine if one has been applied, but if another hadoop admin has configured 
> policies on anything but trivially obvious paths such as /archive then there 
> is no way to find which paths have policies applied to them other than by 
> querying every single directory and subdirectory one by one which might 
> potentially have a policy, eg:
> {code:java}
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir2
> hdfs storagepolicies -getStoragePolicy -path /dir3
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir2
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir3
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN/subdirN/subsubdirN
> ...{code}
> In my current environment for example, a policy was configured for /data/blah 
> which doesn't show when trying
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data{code}
> and I had no way of knowing that I had to do:
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data/blah{code}
> other than trial and error of trying every directory and every subdirectory 
> in hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13724) Storage Tiering Show Paths with Policies applied

2018-07-06 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534859#comment-16534859
 ] 

Hari Sekhon edited comment on HDFS-13724 at 7/6/18 2:26 PM:


I tried a workaround of dupming the for now is to do the following as hdfs 
superuser - dump the fsimage, convert to XML and then grep the tiers path info:
{code:java}
su - hdfs
kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
hdfs dfsadmin -fetchImage .
# this step might take a long time on big clusters (eg. 20 mins for 12GB 
fsimage.xml result file from a moderate 600TB cluster)
hadoop oiv -i $(ls -tr fsimage_* | tail -n1) -p XML -o fsimage.xml
grep ... fsimage.xml{code}
but I can't find anything relating to 'policy' or the name of our storage 
policy or the directory I know it's applied to.


was (Author: harisekhon):
I tried a workaround of dupming the for now is to do the following as hdfs 
superuser - dump the fsimage, convert to XML and then grep the tiers path info:
{code:java}
su - hdfs
kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
hdfs dfsadmin -fetchImage .
# this step might take a long time on big clusters (eg. 20 mins for 12GB 
fsimage.xml result file from a moderate 600TB cluster)
hadoop oiv -i $(ls -tr fsimage_* | tail -n1) -p XML -o fsimage.xml
grep ...{code}
but I can't find anything relating to 'policy' or the name of our storage 
policy or the directory I know it's applied to.

> Storage Tiering Show Paths with Policies applied
> 
>
> Key: HDFS-13724
> URL: https://issues.apache.org/jira/browse/HDFS-13724
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to add an hdfs storagepolicies command to find paths for 
> which storage policies have been applied.
> Right now you must explicitly query a single directory to get its policy to 
> determine if one has been applied, but if another hadoop admin has configured 
> policies on anything but trivially obvious paths such as /archive then there 
> is no way to find which paths have policies applied to them other than by 
> querying every single directory and subdirectory one by one which might 
> potentially have a policy, eg:
> {code:java}
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir2
> hdfs storagepolicies -getStoragePolicy -path /dir3
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir2
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir3
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN/subdirN/subsubdirN
> ...{code}
> In my current environment for example, a policy was configured for /data/blah 
> which doesn't show when trying
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data{code}
> and I had no way of knowing that I had to do:
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data/blah{code}
> other than trial and error of trying every directory and every subdirectory 
> in hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13724) Storage Tiering Show Paths with Policies applied

2018-07-06 Thread Hari Sekhon (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534859#comment-16534859
 ] 

Hari Sekhon commented on HDFS-13724:


I tried a workaround of dupming the for now is to do the following as hdfs 
superuser - dump the fsimage, convert to XML and then grep the tiers path info:
{code:java}
su - hdfs
kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs
hdfs dfsadmin -fetchImage .
# this step might take a long time on big clusters (eg. 20 mins for 12GB 
fsimage.xml result file from a moderate 600TB cluster)
hadoop oiv -i $(ls -tr fsimage_* | tail -n1) -p XML -o fsimage.xml
grep ...{code}
but I can't find anything relating to 'policy' or the name of our storage 
policy or the directory I know it's applied to.

> Storage Tiering Show Paths with Policies applied
> 
>
> Key: HDFS-13724
> URL: https://issues.apache.org/jira/browse/HDFS-13724
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to add an hdfs storagepolicies command to find paths for 
> which storage policies have been applied.
> Right now you must explicitly query a single directory to get its policy to 
> determine if one has been applied, but if another hadoop admin has configured 
> policies on anything but trivially obvious paths such as /archive then there 
> is no way to find which paths have policies applied to them other than by 
> querying every single directory and subdirectory one by one which might 
> potentially have a policy, eg:
> {code:java}
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir2
> hdfs storagepolicies -getStoragePolicy -path /dir3
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir2
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir3
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN/subdirN/subsubdirN
> ...{code}
> In my current environment for example, a policy was configured for /data/blah 
> which doesn't show when trying
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data{code}
> and I had no way of knowing that I had to do:
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data/blah{code}
> other than trial and error of trying every directory and every subdirectory 
> in hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13724) Storage Tiering Show Policies Paths

2018-07-06 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-13724:
--

 Summary: Storage Tiering Show Policies Paths
 Key: HDFS-13724
 URL: https://issues.apache.org/jira/browse/HDFS-13724
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Hari Sekhon


Improvement Request to add an hdfs storagepolicies command to find paths for 
which storage policies have been applied.

Right now you must explicitly query a single directory to get its policy to 
determine if one has been applied, but if another hadoop admin has configured 
policies on anything but trivially obvious paths such as /archive then there is 
no way to find which paths have policies applied to them other than by querying 
every single directory and subdirectory one by one which might potentially have 
a policy, eg:
{code:java}
hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
hdfs storagepolicies -getStoragePolicy -path /dir2
hdfs storagepolicies -getStoragePolicy -path /dir3
hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
hdfs storagepolicies -getStoragePolicy -path /dir3/subdir2
hdfs storagepolicies -getStoragePolicy -path /dir3/subdir3
...
hdfs storagepolicies -getStoragePolicy -path /dirN
...
hdfs storagepolicies -getStoragePolicy -path /dirN/subdirN/subsubdirN
...{code}
In my current environment for example, a policy was configured for /data/blah 
which doesn't show when trying
{code:java}
 hdfs storagepolicies -getStoragePolicy -path /data{code}
and I had no way of knowing that I had to do:
{code:java}
 hdfs storagepolicies -getStoragePolicy -path /data/blah{code}
other than trial and error of trying every directory and every subdirectory in 
hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13724) Storage Tiering Show Paths with Policies applied

2018-07-06 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13724:
---
Summary: Storage Tiering Show Paths with Policies applied  (was: Storage 
Tiering Show Policies Paths)

> Storage Tiering Show Paths with Policies applied
> 
>
> Key: HDFS-13724
> URL: https://issues.apache.org/jira/browse/HDFS-13724
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request to add an hdfs storagepolicies command to find paths for 
> which storage policies have been applied.
> Right now you must explicitly query a single directory to get its policy to 
> determine if one has been applied, but if another hadoop admin has configured 
> policies on anything but trivially obvious paths such as /archive then there 
> is no way to find which paths have policies applied to them other than by 
> querying every single directory and subdirectory one by one which might 
> potentially have a policy, eg:
> {code:java}
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir2
> hdfs storagepolicies -getStoragePolicy -path /dir3
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir1
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir2
> hdfs storagepolicies -getStoragePolicy -path /dir3/subdir3
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN
> ...
> hdfs storagepolicies -getStoragePolicy -path /dirN/subdirN/subsubdirN
> ...{code}
> In my current environment for example, a policy was configured for /data/blah 
> which doesn't show when trying
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data{code}
> and I had no way of knowing that I had to do:
> {code:java}
>  hdfs storagepolicies -getStoragePolicy -path /data/blah{code}
> other than trial and error of trying every directory and every subdirectory 
> in hdfs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13720) HDFS dataset Anti-Affinity Block Placement across all DataNodes for data local task optimization (improve Spark executor utilization & performance)

2018-07-05 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13720:
---
Description: 
Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

Methods to be implemented:
 # dfs balancer command switch plus a target path of files / directories 
containing the blocks to be rebalanced
 # dfs client side write flag

Both options should proactively (re)distribute the given data set as evenly as 
possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning containers to Yarn because 
they cannot find any data to execute on locally isn't a good idea either, they 
want the jobs to use all the resources available and allocated to the job, not 
just the resources on a subset of nodes that hold a given dataset or disabling 
data local task execution to pull half the blocks across the network to make 
use of the other half of the nodes.

https://issues.apache.org/jira/browse/SPARK-24474

  was:
Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

Methods to be implemented:
 # balancer command switch plus a target path of files / directories containing 
the blocks to be rebalanced
 # client side write flag

Both options should proactively (re)distribute the given data set as evenly as 
possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning containers to Yarn because 
they cannot find any data to execute on locally isn't a good idea either, they 
want the jobs to use all the resources available and allocated to the job, not 
just the resources on a subset of nodes that hold a given dataset or disabling 
data local task execution to pull half the blocks across the network to make 
use of the other half of the nodes.

https://issues.apache.org/jira/browse/SPARK-24474


> HDFS dataset Anti-Affinity Block Placement across all DataNodes for data 
> local task optimization (improve Spark executor utilization & performance)
> ---
>
> Key: HDFS-13720
> URL: https://issues.apache.org/jira/browse/HDFS-13720
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request for Anti-Affinity Block Placement across datanodes such 
> that for a given data set the blocks are distributed evenly across all 
> available datanodes in order to improve task scheduling while maintaining 
> data locality.
> Methods to be implemented:
>  # dfs balancer command switch plus a target path of files / directories 
> containing the blocks to be rebalanced
>  # dfs client side write flag
> Both options should proactively (re)distribute the given data set as evenly 
> as possible across all datanodes in the cluster.
> See this following Spark issue which causes massive under-utilisation across 
> jobs. Only 30-50% of executor cores were being used for tasks due to data 
> locality targeting. Many executors doing literally nothing, while holding 
> 

[jira] [Updated] (HDFS-13720) HDFS dataset Anti-Affinity Block Placement across all DataNodes for data local task optimization (improve Spark executor utilization & performance)

2018-07-05 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13720:
---
Description: 
Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

Methods to be implemented:
 # balancer command switch plus a target path of files / directories containing 
the blocks to be rebalanced
 # client side write flag

Both options should proactively (re)distribute the given data set as evenly as 
possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning containers to Yarn because 
they cannot find any data to execute on locally isn't a good idea either, they 
want the jobs to use all the resources available and allocated to the job, not 
just the resources on a subset of nodes that hold a given dataset or disabling 
data local task execution to pull half the blocks across the network to make 
use of the other half of the nodes.

https://issues.apache.org/jira/browse/SPARK-24474

  was:
Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

Methods to be implemented:
 # balancer command switch combined with a target path to files or directories
 # client side write flag

Both options should proactively (re)distribute the given data set as evenly as 
possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning containers to Yarn because 
they cannot find any data to execute on locally isn't a good idea either, they 
want the jobs to use all the resources available and allocated to the job, not 
just the resources on a subset of nodes that hold a given dataset or disabling 
data local task execution to pull half the blocks across the network to make 
use of the other half of the nodes.

https://issues.apache.org/jira/browse/SPARK-24474


> HDFS dataset Anti-Affinity Block Placement across all DataNodes for data 
> local task optimization (improve Spark executor utilization & performance)
> ---
>
> Key: HDFS-13720
> URL: https://issues.apache.org/jira/browse/HDFS-13720
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request for Anti-Affinity Block Placement across datanodes such 
> that for a given data set the blocks are distributed evenly across all 
> available datanodes in order to improve task scheduling while maintaining 
> data locality.
> Methods to be implemented:
>  # balancer command switch plus a target path of files / directories 
> containing the blocks to be rebalanced
>  # client side write flag
> Both options should proactively (re)distribute the given data set as evenly 
> as possible across all datanodes in the cluster.
> See this following Spark issue which causes massive under-utilisation across 
> jobs. Only 30-50% of executor cores were being used for tasks due to data 
> locality targeting. Many executors doing literally nothing, while holding 
> significant cluster resources, because the data set, 

[jira] [Updated] (HDFS-13720) HDFS dataset Anti-Affinity Block Placement across all DataNodes for data local task optimization (improve Spark executor utilization & performance)

2018-07-05 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13720:
---
Summary: HDFS dataset Anti-Affinity Block Placement across all DataNodes 
for data local task optimization (improve Spark executor utilization & 
performance)  (was: HDFS dataset Anti-Affinity Block Placement across DataNodes 
for data local task optimization (improve Spark executor utilization & 
performance))

> HDFS dataset Anti-Affinity Block Placement across all DataNodes for data 
> local task optimization (improve Spark executor utilization & performance)
> ---
>
> Key: HDFS-13720
> URL: https://issues.apache.org/jira/browse/HDFS-13720
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request for Anti-Affinity Block Placement across datanodes such 
> that for a given data set the blocks are distributed evenly across all 
> available datanodes in order to improve task scheduling while maintaining 
> data locality.
> Methods to be implemented:
>  # balancer command switch combined with a target path to files or directories
>  # client side write flag
> Both options should proactively (re)distribute the given data set as evenly 
> as possible across all datanodes in the cluster.
> See this following Spark issue which causes massive under-utilisation across 
> jobs. Only 30-50% of executor cores were being used for tasks due to data 
> locality targeting. Many executors doing literally nothing, while holding 
> significant cluster resources, because the data set, which in at least one 
> job was large enough to have 30,000 tasks churning though slowly on only a 
> subset of the available executors. The workaround in the end was to disable 
> data local tasks in Spark, but if everyone did that the bottleneck would go 
> back to being the network and it undermines Hadoop's first premise of don't 
> move the data to compute. For performance critical jobs, returning containers 
> to Yarn because they cannot find any data to execute on locally isn't a good 
> idea either, they want the jobs to use all the resources available and 
> allocated to the job, not just the resources on a subset of nodes that hold a 
> given dataset or disabling data local task execution to pull half the blocks 
> across the network to make use of the other half of the nodes.
> https://issues.apache.org/jira/browse/SPARK-24474



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13720) HDFS dataset Anti-Affinity Block Placement across DataNodes for data local task optimization (improve Spark executor utilization & performance)

2018-07-05 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13720:
---
Description: 
Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

Methods to be implemented:
 # balancer command switch combined with a target path to files or directories
 # client side write flag

Both options should proactively (re)distribute the given data set as evenly as 
possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning containers to Yarn because 
they cannot find any data to execute on locally isn't a good idea either, they 
want the jobs to use all the resources available and allocated to the job, not 
just the resources on a subset of nodes that hold a given dataset or disabling 
data local task execution to pull half the blocks across the network to make 
use of the other half of the nodes.

https://issues.apache.org/jira/browse/SPARK-24474

  was:
Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

Methods to be implemented:
 # balancer command switch combined with a target path to files or directories
 # client side write flag

Both options should proactively (re)distribute the given data set as evenly as 
possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning tasks to Yarn isn't a good 
idea either, they want the jobs to use all the resources available, not just 
the resources on a subset of nodes that hold a given dataset or pulling half 
the blocks across the network.

https://issues.apache.org/jira/browse/SPARK-24474


> HDFS dataset Anti-Affinity Block Placement across DataNodes for data local 
> task optimization (improve Spark executor utilization & performance)
> ---
>
> Key: HDFS-13720
> URL: https://issues.apache.org/jira/browse/HDFS-13720
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request for Anti-Affinity Block Placement across datanodes such 
> that for a given data set the blocks are distributed evenly across all 
> available datanodes in order to improve task scheduling while maintaining 
> data locality.
> Methods to be implemented:
>  # balancer command switch combined with a target path to files or directories
>  # client side write flag
> Both options should proactively (re)distribute the given data set as evenly 
> as possible across all datanodes in the cluster.
> See this following Spark issue which causes massive under-utilisation across 
> jobs. Only 30-50% of executor cores were being used for tasks due to data 
> locality targeting. Many executors doing literally nothing, while holding 
> significant cluster resources, because the data set, which in at least one 
> job was large enough to have 30,000 tasks churning though slowly on only a 
> subset of the available executors. The workaround in the end was to disable 
> data local tasks in Spark, but if everyone did that the 

[jira] [Updated] (HDFS-13720) HDFS dataset Anti-Affinity Block Placement across DataNodes for data local task optimization (improve Spark executor utilization & performance)

2018-07-05 Thread Hari Sekhon (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13720:
---
Description: 
Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

Methods to be implemented:
 # balancer command switch combined with a target path to files or directories
 # client side write flag

Both options should proactively (re)distribute the given data set as evenly as 
possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning tasks to Yarn isn't a good 
idea either, they want the jobs to use all the resources available, not just 
the resources on a subset of nodes that hold a given dataset or pulling half 
the blocks across the network.

https://issues.apache.org/jira/browse/SPARK-24474

  was:
Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

This could be done via a client side write flag as well as via a balancer 
command switch combined with giving a target path to files or directories to 
redistributed as evenly as possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning tasks to Yarn isn't a good 
idea either, they want the jobs to use all the resources available, not just 
the resources on a subset of nodes that hold a given dataset or pulling half 
the blocks across the network.

https://issues.apache.org/jira/browse/SPARK-24474


> HDFS dataset Anti-Affinity Block Placement across DataNodes for data local 
> task optimization (improve Spark executor utilization & performance)
> ---
>
> Key: HDFS-13720
> URL: https://issues.apache.org/jira/browse/HDFS-13720
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, block placement, performance
>Affects Versions: 2.7.3
> Environment: Hortonworks HDP 2.6
>Reporter: Hari Sekhon
>Priority: Major
>
> Improvement Request for Anti-Affinity Block Placement across datanodes such 
> that for a given data set the blocks are distributed evenly across all 
> available datanodes in order to improve task scheduling while maintaining 
> data locality.
> Methods to be implemented:
>  # balancer command switch combined with a target path to files or directories
>  # client side write flag
> Both options should proactively (re)distribute the given data set as evenly 
> as possible across all datanodes in the cluster.
> See this following Spark issue which causes massive under-utilisation across 
> jobs. Only 30-50% of executor cores were being used for tasks due to data 
> locality targeting. Many executors doing literally nothing, while holding 
> significant cluster resources, because the data set, which in at least one 
> job was large enough to have 30,000 tasks churning though slowly on only a 
> subset of the available executors. The workaround in the end was to disable 
> data local tasks in Spark, but if everyone did that the bottleneck would go 
> back to being the network and it undermines Hadoop's first premise of don't 
> move the data to compute. For performance critical jobs, returning tasks to 
> Yarn isn't a good 

[jira] [Created] (HDFS-13720) HDFS dataset Anti-Affinity Block Placement across DataNodes for data local task optimization (improve Spark executor utilization & performance)

2018-07-05 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-13720:
--

 Summary: HDFS dataset Anti-Affinity Block Placement across 
DataNodes for data local task optimization (improve Spark executor utilization 
& performance)
 Key: HDFS-13720
 URL: https://issues.apache.org/jira/browse/HDFS-13720
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, block placement, performance
Affects Versions: 2.7.3
 Environment: Hortonworks HDP 2.6
Reporter: Hari Sekhon


Improvement Request for Anti-Affinity Block Placement across datanodes such 
that for a given data set the blocks are distributed evenly across all 
available datanodes in order to improve task scheduling while maintaining data 
locality.

This could be done via a client side write flag as well as via a balancer 
command switch combined with giving a target path to files or directories to 
redistributed as evenly as possible across all datanodes in the cluster.

See this following Spark issue which causes massive under-utilisation across 
jobs. Only 30-50% of executor cores were being used for tasks due to data 
locality targeting. Many executors doing literally nothing, while holding 
significant cluster resources, because the data set, which in at least one job 
was large enough to have 30,000 tasks churning though slowly on only a subset 
of the available executors. The workaround in the end was to disable data local 
tasks in Spark, but if everyone did that the bottleneck would go back to being 
the network and it undermines Hadoop's first premise of don't move the data to 
compute. For performance critical jobs, returning tasks to Yarn isn't a good 
idea either, they want the jobs to use all the resources available, not just 
the resources on a subset of nodes that hold a given dataset or pulling half 
the blocks across the network.

https://issues.apache.org/jira/browse/SPARK-24474



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-13312) NameNode High Availability ZooKeeper based discovery rather than explicit nn1,nn2 configs

2018-03-19 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-13312:
--

 Summary: NameNode High Availability ZooKeeper based discovery 
rather than explicit nn1,nn2 configs
 Key: HDFS-13312
 URL: https://issues.apache.org/jira/browse/HDFS-13312
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha, hdfs, namenode, nn
Affects Versions: 2.9.1
Reporter: Hari Sekhon


Improvement Request for HDFS NameNode HA to use ZooKeeper based dynamic 
discovery rather than explicitly setting the NameNode addresses via nn1,n2 in 
the configs.

One proprietary Hadoop vendor already uses ZK for Resource Manager HA discovery 
- it makes sense that the open source core should do this for both Yarn and 
HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2017-08-02 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108851#comment-16108851
 ] 

Hari Sekhon edited comment on HDFS-8298 at 8/2/17 2:30 PM:
---

[~qwertymaniac] I understand that this is current design - but this doesn't 
mean it couldn't be improved - hence why I filed this as an improvement and not 
a bug, although in common sense terms it is a bug of design - traditional High 
Availability solutions don't shut down permanently for temporary network 
outages.

The specific idea for improvement is simply to drop to Standby mode, allow no 
more edits and then retry every 30 secs (configurable) to try to regain QJM 
quorum and re-promote one of the NameNodes to Active once quorum is 
re-established. If you can't do this because of the write behind log then at 
least only kill the Active Namenode and allow the standby to stay online and be 
promoted to Active once quorum is re-established. Ideally if necessary improve 
the Active NameNode to be able to discard the transactions that cannot be 
committed to the edits log without the quorum and then drop to standby read 
only mode without shutting down the whole process. I can't see any reason why 
this wouldn't be possible even if it required more code change to fix this 
behaviour.

In this case the edits logs would be protected from diverging and there is no 
reason not to keep the process alive as it then makes it possible to re-elect 
an active namenode once the quorum is re-established and would give more 
availability, which is really the point of HA.

Right now customers are working around a flawed design by restarting things 
whenever there is any minor temporary network interruption.

[~andrew.wang] I've just had another large customer encounter the same issue 
and of course they just started the cluster again, carry on and live with it - 
they don't even bother raising it to the vendors to debug it since it works 
again after a restart, but it's still broken behaviour. Even on site I only 
hear about these things in passing conversation. Temporary network problems are 
more common than you'd think as anybody who has been an industrial level 
networking specialist will know. The fact that today customers are simply 
restarting their clusters and living with it whenever this crops up doesn't 
make it uncommon, their system administrators simply don't understand the 
design enough to understand that this could have been improved. I've personally 
seen this more times than I've reported and I know other people don't even 
bother taking the time to report these things, either because they don't 
understand what could be improved or because they can't be bothered to use 
their time to help vendors improve their product, it's quicker to just start 
the cluster again, it works and they want to forget about it and move on.

Also consider weekend planned maintenance network outages, this has also 
happened to me before and there is no reason I should be coming in Monday 
mornings every few months to a cluster that is down because the design didn't 
get fixed (yes you could argue the network team should have notified us of 
quarterly maintenance windows, maybe they did and we missed the email or 
forgot, perhaps everybody should script cluster shutdown and startup around 
maintenance windows and have monitoring that tries to auto-restart the cluster 
if it's down - but this is all a plaster to the symptoms rather than a cure to 
the tech design - and other times it's not planned maintenance but actual 
unpredictable network faults).


was (Author: harisekhon):
[~qwertymaniac] I understand that this is current design - but this doesn't 
mean it couldn't be improved - hence why I filed this as an improvement and not 
a bug, although in common sense terms it is a bug of design - traditional High 
Availability solutions don't shut down permanently for temporary network 
outages.

The specific idea for improvement is simply to drop to Standby mode, allow no 
more edits and then retry every 30 secs (configurable) to try to regain QJM 
quorum and re-promote one of the NameNodes to Active once quorum is 
re-established. If you can't do this because of the write behind log then at 
least only kill the Active Namenode and allow the standby to stay online and be 
promoted to Active once quorum is re-established. Ideally if necessary improve 
the Active NameNode to be able to discard the transactions that cannot be 
committed to the edits log without the quorum and then drop to standby read 
only mode without shutting down the whole process. I can't see any reason why 
this wouldn't be possible even if it required more code change to fix this 
behaviour.

In this case the edits logs would be protected from diverging and there is no 
reason not to keep the process alive as it then makes it possible to re-elect 
an active namenode once the 

[jira] [Updated] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2017-08-01 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8298:
--
Environment: multiple clients, HDP 2.2, HDP 2.5, CDH etc

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> ---
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode, qjm
>Affects Versions: 2.6.0, 2.7.3
> Environment: multiple clients, HDP 2.2, HDP 2.5, CDH etc
>Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
> respond.
> at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
> at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
> at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
> at java.lang.Thread.run(Thread.java:745)
> 2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
> txid 54270281
> 2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
> Exiting with status 1
> 2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
> SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at /
> /{code}
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2017-08-01 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8298:
--
Affects Version/s: 2.7.3

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> ---
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode, qjm
>Affects Versions: 2.6.0, 2.7.3
>Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
> respond.
> at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
> at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
> at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
> at java.lang.Thread.run(Thread.java:745)
> 2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
> txid 54270281
> 2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
> Exiting with status 1
> 2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
> SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at /
> /{code}
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2017-08-01 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108851#comment-16108851
 ] 

Hari Sekhon commented on HDFS-8298:
---

[~qwertymaniac] I understand that this is current design - but this doesn't 
mean it couldn't be improved - hence why I filed this as an improvement and not 
a bug, although in common sense terms it is a bug of design - traditional High 
Availability solutions don't shut down permanently for temporary network 
outages.

The specific idea for improvement is simply to drop to Standby mode, allow no 
more edits and then retry every 30 secs (configurable) to try to regain QJM 
quorum and re-promote one of the NameNodes to Active once quorum is 
re-established. If you can't do this because of the write behind log then at 
least only kill the Active Namenode and allow the standby to stay online and be 
promoted to Active once quorum is re-established. Ideally if necessary improve 
the Active NameNode to be able to discard the transactions that cannot be 
committed to the edits log without the quorum and then drop to standby read 
only mode without shutting down the whole process. I can't see any reason why 
this wouldn't be possible even if it required more code change to fix this 
behaviour.

In this case the edits logs would be protected from diverging and there is no 
reason not to keep the process alive as it then makes it possible to re-elect 
an active namenode once the quorum is re-established and would give more 
availability, which is really the point of HA.

Right now customers are working around a flawed design by restarting things 
whenever there is any minor temporary network interruptions.

[~andrew.wang] I've just had another large customer encounter the same issue 
and of course they just started the cluster again, carry on and live with it - 
they don't even bother raising it to the vendors to debug it since it works 
again after a restart, but it's still broken behaviour. Even on site I only 
hear about these things in passing conversation. Temporary network problems are 
more common than you'd think as anybody who has been an industrial level 
networking specialist will know. The fact that today customers are simply 
restarting their clusters and living with it whenever this crops up doesn't 
make it uncommon, their system administrators simply don't understand the 
design enough to understand that this could have been improved. I've personally 
seen this more times than I've reported and I know other people don't even 
bother taking the time to report these things, either because they don't 
understand what could be improved or because they can't be bothered to use 
their time to help vendors improve their product, it's quicker to just start 
the cluster again, it works and they want to forget about it and move on.

Also consider weekend planned maintenance network outages, this has also 
happened to me before and there is no reason I should be coming in Monday 
mornings every few months to a cluster that is down because the design didn't 
get fixed (yes you could argue the network team should have notified us of 
quarterly maintenance windows, maybe they did and we missed the email or 
forgot, perhaps everybody should script cluster shutdown and startup around 
maintenance windows and have monitoring that tries to auto-restart the cluster 
if it's down - but this is all a plaster to the symptoms rather than a cure to 
the tech design - and other times it's not planned maintenance but actual 
unpredictable network faults).

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> ---
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode, qjm
>Affects Versions: 2.6.0
>Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted 

[jira] [Reopened] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2017-08-01 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon reopened HDFS-8298:
---

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> ---
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode, qjm
>Affects Versions: 2.6.0
>Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
> respond.
> at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
> at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
> at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
> at java.lang.Thread.run(Thread.java:745)
> 2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
> txid 54270281
> 2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
> Exiting with status 1
> 2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
> SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at /
> /{code}
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-2542) Transparent compression storage in HDFS

2017-05-11 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006648#comment-16006648
 ] 

Hari Sekhon commented on HDFS-2542:
---

I recall looking for this feature 2-3 years ago while at a large financial and 
it's just come up again with another large financial client I'm working for 
right now.

I see I actually already upvoted this jira the last time I looked at it but 
there has been no movement on this in years.

Having transparent compression on a directory tree would be a very useful 
feature.

Is there any chance of this being implemented?

> Transparent compression storage in HDFS
> ---
>
> Key: HDFS-2542
> URL: https://issues.apache.org/jira/browse/HDFS-2542
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: jinglong.liujl
> Attachments: tranparent compress storage.docx
>
>
> As HDFS-2115, we want to provide a mechanism to improve storage usage in hdfs 
> by compression. Different from HDFS-2115, this issue focus on compress 
> storage. Some idea like below:
> To do:
> 1. compress cold data.
>Cold data: After writing (or last read), data has not touched by anyone 
> for a long time.
>Hot data: After writing, many client will read it , maybe it'll delele 
> soon.
>
>Because hot data compression is not cost-effective,  we only compress cold 
> data. 
>In some cases, some data in file can be access in high frequency,  but in 
> the same file, some data may be cold data. 
> To distinguish them, we compress in block level.
> 2. compress data which has high compress ratio.
>To specify high/low compress ratio, we should try to compress data, if 
> compress ratio is too low, we'll never compress them.
> 2. forward compatibility.
> After compression, data format in datanode has changed. Old client will 
> not access them. To solve this issue, we provide a mechanism which decompress 
> on datanode.
> 3. support random access and append.
>As HDFS-2115, random access can be support by index. We separate data 
> before compress by fixed-length (we call these fixed-length data as "chunk"), 
> every chunk has its index.
> When random access, we can seek to the nearest index, and read this chunk for 
> precise position.   
> 4. async compress to avoid compression slow down running job.
>In practice, we found the cluster CPU usage is not uniform. Some clusters 
> are idle at night, and others are idle at afternoon. We should make compress 
> task running in full speed when cluster idle, and in low speed when cluster 
> busy.
> Will do:
> 1. client specific codec and support  compress transmission.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11400) Automatic HDFS Home Directory Creation

2017-02-10 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861450#comment-15861450
 ] 

Hari Sekhon edited comment on HDFS-11400 at 2/10/17 4:05 PM:
-

[~aw]

bq. If I access a home dir as a privileged user (e.g., hdfs) then I'm not sure 
why there would be a validation made against an individual user's external 
existence.

That's not the use case - it's only when an actual user tries to do something 
in hdfs and there is no home directory detected for that same user - this does 
not apply to hdfs superuser operations at all - in fact validating "against an 
external user's existence" when touching a home directory is the check in the 
wrong direction entirely.

This is more for jobs run by a user for which a home dir wasn't set up (the 
users just pop up and start using the cluster in large enterprises as they're 
in some other part of the enterprise that you never see but are added in an AD 
group that is allowed on the cluster - they could be new guys or just someone 
you just never met because it's a big company).

bq. Whoever is building this on a per client basis ...

Ever tried copying your pre-written code from your github or private machine to 
Banks, government environments and large traditional enterprises where 
everything is firewalled off, the internet is blocked to server networks and 
nothing is allowed in or out? Write it again :-/ . Most people in those types 
of places just have a dumb sheet that they have to follow for every single 
person who requests to use the cluster as their jobs fail otherwise... they're 
lucky if somebody even scripts it for them.

Yes it's only a couple of commands but people in those types of environments 
don't know anything - which may be hard to understand how bad it is if you're 
used to working for tech startups with smart techies and little security - so 
you have to script it again for them to happen behind the scenes.

bq. Also, doesn't the NN plugin system already give one a way to implement this 
feature without clogging up the rest of the code base?

If such a plugin is bundled and available in core hdfs and enabled with a 
simple config change then ok but otherwise that idea is Dead-on-Arrival in a 
large chunk of verticals which do not allow downloading and installing random 
things from the internet, which includes pretty much all banks in the world, 
government departments and large traditional enterprises.

FYI in large environments the account validation and group memberships are 
handled by people you never see through internal request systems, Hadoop 
administrators never touch those things beyond the initial setup of which 
groups are allowed on the cluster, from then onwards all new users and group 
memberships etc are handled by Active Directory teams that you never see 
because they're in some other part of the large organization, and possible in 
different geographic locations.


was (Author: harisekhon):
bq. If I access a home dir as a privileged user (e.g., hdfs) then I'm not sure 
why there would be a validation made against an individual user's external 
existence.

That's not the use case - it's only when an actual user tries to do something 
in hdfs and there is no home directory detected for that same user - this does 
not apply to hdfs superuser operations at all - in fact validating "against an 
external user's existence" when touching a home directory is the check in the 
wrong direction entirely.

This is more for jobs run by a user for which a home dir wasn't set up (the 
users just pop up and start using the cluster in large enterprises as they're 
in some other part of the enterprise that you never see but are added in an AD 
group that is allowed on the cluster - they could be new guys or just someone 
you just never met because it's a big company).

bq. Whoever is building this on a per client basis ...

Ever tried copying your pre-written code from your github or private machine to 
Banks, government environments and large traditional enterprises where 
everything is firewalled off, the internet is blocked to server networks and 
nothing is allowed in or out? Write it again :-/ . Most people in those types 
of places just have a dumb sheet that they have to follow for every single 
person who requests to use the cluster as their jobs fail otherwise... they're 
lucky if somebody even scripts it for them.

Yes it's only a couple of commands but people in those types of environments 
don't know anything - which may be hard to understand how bad it is if you're 
used to working for tech startups with smart techies and little security - so 
you have to script it again for them to happen behind the scenes.

bq. Also, doesn't the NN plugin system already give one a way to implement this 
feature without clogging up the rest of the code base?

If such a plugin is bundled and available in core hdfs 

[jira] [Comment Edited] (HDFS-11400) Automatic HDFS Home Directory Creation

2017-02-10 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861456#comment-15861456
 ] 

Hari Sekhon edited comment on HDFS-11400 at 2/10/17 4:04 PM:
-

[~cheersyang]

bq. ambari, handles the creation of necessary HDFS dirs, that works nicely and 
gives admin full control of file system layout

Ambari does not handle auto creation of user home directories - I've run it for 
several years for several clients at this point and customers still run in to 
this little bug bear - otherwise I wouldn't have raised this ticket.

bq. Similarly on linux, user home dirs were not created automatically, that is 
admin's job.

When you SSH in to Linux systems /etc/skel is copied to instantiate 
/home/ if it doesn't already exist - this is where I got the idea 
from, it's been this way for many years and been a widely useful paradigm for 
millions of people already.


was (Author: harisekhon):
bq. ambari, handles the creation of necessary HDFS dirs, that works nicely and 
gives admin full control of file system layout

Ambari does not handle auto creation of user home directories - I've run it for 
several years for several clients at this point and customers still run in to 
this little bug bear - otherwise I wouldn't have raised this ticket.

bq. Similarly on linux, user home dirs were not created automatically, that is 
admin's job.

When you SSH in to Linux systems /etc/skel is copied to instantiate 
/home/ if it doesn't already exist - this is where I got the idea 
from, it's been this way for many years and been a widely useful paradigm for 
millions of people already.

> Automatic HDFS Home Directory Creation
> --
>
> Key: HDFS-11400
> URL: https://issues.apache.org/jira/browse/HDFS-11400
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs, namenode
>Affects Versions: 2.7.1
> Environment: HDP 2.4.2
>Reporter: Hari Sekhon
>
> Feature Request to add automatic home directory creation for HDFS users when 
> they are first resolved by the NameNode if their home directory does not 
> already exist, using configurable umask defaulting to 027.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11400) Automatic HDFS Home Directory Creation

2017-02-10 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861456#comment-15861456
 ] 

Hari Sekhon commented on HDFS-11400:


bq. ambari, handles the creation of necessary HDFS dirs, that works nicely and 
gives admin full control of file system layout

Ambari does not handle auto creation of user home directories - I've run it for 
several years for several clients at this point and customers still run in to 
this little bug bear - otherwise I wouldn't have raised this ticket.

bq. Similarly on linux, user home dirs were not created automatically, that is 
admin's job.

When you SSH in to Linux systems /etc/skel is copied to instantiate 
/home/ if it doesn't already exist - this is where I got the idea 
from, it's been this way for many years and been a widely useful paradigm for 
millions of people already.

> Automatic HDFS Home Directory Creation
> --
>
> Key: HDFS-11400
> URL: https://issues.apache.org/jira/browse/HDFS-11400
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs, namenode
>Affects Versions: 2.7.1
> Environment: HDP 2.4.2
>Reporter: Hari Sekhon
>
> Feature Request to add automatic home directory creation for HDFS users when 
> they are first resolved by the NameNode if their home directory does not 
> already exist, using configurable umask defaulting to 027.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11400) Automatic HDFS Home Directory Creation

2017-02-10 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861450#comment-15861450
 ] 

Hari Sekhon commented on HDFS-11400:


bq. If I access a home dir as a privileged user (e.g., hdfs) then I'm not sure 
why there would be a validation made against an individual user's external 
existence.

That's not the use case - it's only when an actual user tries to do something 
in hdfs and there is no home directory detected for that same user - this does 
not apply to hdfs superuser operations at all - in fact validating "against an 
external user's existence" when touching a home directory is the check in the 
wrong direction entirely.

This is more for jobs run by a user for which a home dir wasn't set up (the 
users just pop up and start using the cluster in large enterprises as they're 
in some other part of the enterprise that you never see but are added in an AD 
group that is allowed on the cluster - they could be new guys or just someone 
you just never met because it's a big company).

bq. Whoever is building this on a per client basis ...

Ever tried copying your pre-written code from your github or private machine to 
Banks, government environments and large traditional enterprises where 
everything is firewalled off, the internet is blocked to server networks and 
nothing is allowed in or out? Write it again :-/ . Most people in those types 
of places just have a dumb sheet that they have to follow for every single 
person who requests to use the cluster as their jobs fail otherwise... they're 
lucky if somebody even scripts it for them.

Yes it's only a couple of commands but people in those types of environments 
don't know anything - which may be hard to understand how bad it is if you're 
used to working for tech startups with smart techies and little security - so 
you have to script it again for them to happen behind the scenes.

bq. Also, doesn't the NN plugin system already give one a way to implement this 
feature without clogging up the rest of the code base?

If such a plugin is bundled and available in core hdfs and enabled with a 
simple config change then ok but otherwise that idea is Dead-on-Arrival in a 
large chunk of verticals which do not allow downloading and installing random 
things from the internet, which includes pretty much all banks in the world, 
government departments and large traditional enterprises.

FYI in large environments the account validation and group memberships are 
handled by people you never see through internal request systems, Hadoop 
administrators never touch those things beyond the initial setup of which 
groups are allowed on the cluster, from then onwards all new users and group 
memberships etc are handled by Active Directory teams that you never see 
because they're in some other part of the large organization, and possible in 
different geographic locations.

> Automatic HDFS Home Directory Creation
> --
>
> Key: HDFS-11400
> URL: https://issues.apache.org/jira/browse/HDFS-11400
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs, namenode
>Affects Versions: 2.7.1
> Environment: HDP 2.4.2
>Reporter: Hari Sekhon
>
> Feature Request to add automatic home directory creation for HDFS users when 
> they are first resolved by the NameNode if their home directory does not 
> already exist, using configurable umask defaulting to 027.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11400) Automatic HDFS Home Directory Creation

2017-02-10 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861287#comment-15861287
 ] 

Hari Sekhon commented on HDFS-11400:


[~aw] Good question. Where are such fake users coming from? Given NN resolves 
users from OS / Kerberos, this would mean the OS / Kerberos systems have 
already been compromised to have had fake users added?

Putting a configurable user/group filter to only automatically create home 
directories for a whitelisted regex of users/groups could form a layer of 
protection. For example in a cluster integrated with Active Directory which 
might have 20,000 users you may only want 100 of those users actually using the 
Hadoop cluster. Although in practice this filtering is usually already done at 
the OS level via SSSD etc.

Another layer of protection could be a setting on max enumerated users for 
which home directories were going to be automatically created or max number of 
home directories already in existence - if the enumerated users or the number 
of existing home directories is too high, eg. 1000 then log it and disable 
auto-creation until resolved to prevent said memory explosion. Really the 
second idea on number of home directories in existence before disabling auto 
home directory creation would be better as it shouldn't really be enumerating 
users but rather creating the home directory on the fly each time a single new 
user is first used on the cluster and no home directory exists for the user.

How about these ideas?

This would stop various jobs from breaking where they try to put staging files 
etc in home directories that don't exist because they haven't been manually 
created yet or scripted (it seems silly in retrospect for admins to keep 
writing scripts to do this for every client when this could be solved once and 
for all via NN logic).

> Automatic HDFS Home Directory Creation
> --
>
> Key: HDFS-11400
> URL: https://issues.apache.org/jira/browse/HDFS-11400
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: hdfs, namenode
>Affects Versions: 2.7.1
> Environment: HDP 2.4.2
>Reporter: Hari Sekhon
>
> Feature Request to add automatic home directory creation for HDFS users when 
> they are first resolved by the NameNode if their home directory does not 
> already exist, using configurable umask defaulting to 027.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-11400) Automatic HDFS Home Directory Creation

2017-02-09 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-11400:
--

 Summary: Automatic HDFS Home Directory Creation
 Key: HDFS-11400
 URL: https://issues.apache.org/jira/browse/HDFS-11400
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: hdfs, namenode
Affects Versions: 2.7.1
 Environment: HDP 2.4.2
Reporter: Hari Sekhon


Feature Request to add automatic home directory creation for HDFS users when 
they are first resolved by the NameNode if their home directory does not 
already exist, using configurable umask defaulting to 027.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2015-11-16 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006885#comment-15006885
 ] 

Hari Sekhon commented on HDFS-8298:
---

Hi [~qwertymaniac], I understand your points but this is anti-resilience. I'd 
understand rolling back the last transaction if incomplete but shutting down 
the whole NN to never retry from the last safe transaction is broken behaviour 
from an operational environment's standpoint and makes the HDFS HA setup more 
brittle than the original SPOF HDFS! If you're going to close this ticket 
please raise the necessary more specific improvements in separate tickets 
referencing this one for history, such as 1. incomplete transaction rollback 
for loss of quorum, 2. transition to standby for later retries instead of 
shutdown for loss of quorum etc.

> HA: NameNode should not shut down completely without quorum, doesn't recover 
> from temporary network outages
> ---
>
> Key: HDFS-8298
> URL: https://issues.apache.org/jira/browse/HDFS-8298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, HDFS, namenode, qjm
>Affects Versions: 2.6.0
>Reporter: Hari Sekhon
>
> In an HDFS HA setup if there is a temporary problem with contacting journal 
> nodes (eg. network interruption), the NameNode shuts down entirely, when it 
> should instead go in to a standby mode so that it can stay online and retry 
> to achieve quorum later.
> If both NameNodes shut themselves off like this then even after the temporary 
> network outage is resolved, the entire cluster remains offline indefinitely 
> until operator intervention, whereas it could have self-repaired after 
> re-contacting the journalnodes and re-achieving quorum.
> {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
> (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
> required journal (JournalAndStre
> am(mgr=QJM to [:8485, :8485, :8485], stream=QuorumOutputStream 
> starting at txid 54270281))
> java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
> respond.
> at 
> org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
> at 
> org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
> at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
> at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
> at java.lang.Thread.run(Thread.java:745)
> 2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
> (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
> txid 54270281
> 2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
> Exiting with status 1
> 2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
> SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at /
> /{code}
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8341) HDFS mover stuck in loop trying to move corrupt block with no other valid replicas, doesn't move rest of other data blocks

2015-09-18 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805196#comment-14805196
 ] 

Hari Sekhon commented on HDFS-8341:
---

All I can say is what I observed, as per the original code paste showing it 
looped on the same block number on each run and never got past it.

I've moved on since then and that storage tier was decommissioned anyway so I 
don't have any way to reproduce it right now. Perhaps a new cluster with a 
storage tiering with rep factor 1 where the block is intentionally corrupted 
might be able to reproduce this.

> HDFS mover stuck in loop trying to move corrupt block with no other valid 
> replicas, doesn't move rest of other data blocks
> --
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.6.0
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't 
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage 
> tier used for archive (we've had problems with that proprietary "hyperscale" 
> storage product which is why a couple blocks here and there have checksum 
> problems or premature eof as shown below), but this should not prevent moving 
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
> {hdfs://nameservice1=[/apps/hive/warehouse/]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> 
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8341) HDFS mover stuck in loop trying to move corrupt block with no other valid replicas, doesn't move rest of other data blocks

2015-09-18 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875917#comment-14875917
 ] 

Hari Sekhon commented on HDFS-8341:
---

[~surendrasingh] it doesn't matter if the block replica locations are shuffled 
or not when there is only one replica.

The crux of the problem is not the locations, it's that it exits and tries from 
the same block again which doesn't have any uncorrupted replicas, so the mover 
never progresses to the next block.

> HDFS mover stuck in loop trying to move corrupt block with no other valid 
> replicas, doesn't move rest of other data blocks
> --
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.6.0
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't 
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage 
> tier used for archive (we've had problems with that proprietary "hyperscale" 
> storage product which is why a couple blocks here and there have checksum 
> problems or premature eof as shown below), but this should not prevent moving 
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
> {hdfs://nameservice1=[/apps/hive/warehouse/]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> 
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8341) HDFS mover stuck in loop trying to move corrupt block with no other valid replicas, doesn't move rest of other data blocks

2015-09-18 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875829#comment-14875829
 ] 

Hari Sekhon commented on HDFS-8341:
---

[~szetszwo] No I meant the original log description at the top which shows
{code}balancer.Dispatcher: Failed to move blk_1075156654_1438349{code} repeats 
over and over in the output which is what made me think it was looping on the 
same block.

There's only 1 replica for each block... so it's not iterating on locations as 
the code snippet you are pointing to suggests since there are no other 
locations to try, but exiting and then restarting at the same block which still 
has no uncorrupted replicas available, exiting again, restarting at the same 
block again etc.

> HDFS mover stuck in loop trying to move corrupt block with no other valid 
> replicas, doesn't move rest of other data blocks
> --
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.6.0
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't 
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage 
> tier used for archive (we've had problems with that proprietary "hyperscale" 
> storage product which is why a couple blocks here and there have checksum 
> problems or premature eof as shown below), but this should not prevent moving 
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
> {hdfs://nameservice1=[/apps/hive/warehouse/]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> 
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8341) (Summary & Description may be invalid) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning externa

2015-09-17 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14802746#comment-14802746
 ] 

Hari Sekhon commented on HDFS-8341:
---

[~szetszwo] I believe this ticket is still valid:

There were holes in the data since that storage tier had replication factor 1 
as the replication was supposed to be handled within the proprietary hyperscale 
storage solution underpinning that tier so there was no point in storing 
multiple HDFS replicas there. So if a given block's checksum failed, HDFS Mover 
looped on that block (probably hoping to find other valid block replicas to use 
but there were no other replicas so it was stuck looping on the one corrupt 
replica) and never got past that block so it didn't transfer the rest of the 
data's other blocks.

This would be the same problem if all replicas were corrupt or if a block was 
under replicated (which happens often) and the existing replica was corrupt.

So this jira is still valid - if HDFS Mover can't find a valid/non-corrupt 
replica then it doesn't proceed to move the rest of the other blocks, which 
prevented decommissioning of this storage tier. This is the reason I scripted a 
custom recovery job under the hood of Hadoop since the other blocks were fine 
and it was leaving a lot of data behind on the external storage tier.

> (Summary & Description may be invalid) HDFS mover stuck in loop after failing 
> to move block, doesn't move rest of blocks, can't get data back off 
> decommissioning external storage tier as a result
> ---
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.6.0
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't 
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage 
> tier used for archive (we've had problems with that proprietary "hyperscale" 
> storage product which is why a couple blocks here and there have checksum 
> problems or premature eof as shown below), but this should not prevent moving 
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
> {hdfs://nameservice1=[/apps/hive/warehouse/]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> 
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HDFS-8341) (Summary & Description may be invalid) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external

2015-09-17 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon reopened HDFS-8341:
---

> (Summary & Description may be invalid) HDFS mover stuck in loop after failing 
> to move block, doesn't move rest of blocks, can't get data back off 
> decommissioning external storage tier as a result
> ---
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.6.0
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't 
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage 
> tier used for archive (we've had problems with that proprietary "hyperscale" 
> storage product which is why a couple blocks here and there have checksum 
> problems or premature eof as shown below), but this should not prevent moving 
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
> {hdfs://nameservice1=[/apps/hive/warehouse/]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> 
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8341) HDFS mover stuck in loop trying to move corrupt block with no other valid replicas, doesn't move rest of other data blocks

2015-09-17 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8341:
--
Summary: HDFS mover stuck in loop trying to move corrupt block with no 
other valid replicas, doesn't move rest of other data blocks  (was: HDFS mover 
stuck in loop trying to move corrupt block with no other valid replicas, 
doesn't move rest of other data blocks, can't get data back off decommissioning 
external storage tier as a result)

> HDFS mover stuck in loop trying to move corrupt block with no other valid 
> replicas, doesn't move rest of other data blocks
> --
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.6.0
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't 
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage 
> tier used for archive (we've had problems with that proprietary "hyperscale" 
> storage product which is why a couple blocks here and there have checksum 
> problems or premature eof as shown below), but this should not prevent moving 
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
> {hdfs://nameservice1=[/apps/hive/warehouse/]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> 
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8341) HDFS mover stuck in loop trying to move corrupt block with no other valid replicas, doesn't move rest of other data blocks, can't get data back off decommissioning externa

2015-09-17 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8341:
--
Summary: HDFS mover stuck in loop trying to move corrupt block with no 
other valid replicas, doesn't move rest of other data blocks, can't get data 
back off decommissioning external storage tier as a result  (was: (Summary & 
Description may be invalid) HDFS mover stuck in loop after failing to move 
block, doesn't move rest of blocks, can't get data back off decommissioning 
external storage tier as a result)

> HDFS mover stuck in loop trying to move corrupt block with no other valid 
> replicas, doesn't move rest of other data blocks, can't get data back off 
> decommissioning external storage tier as a result
> -
>
> Key: HDFS-8341
> URL: https://issues.apache.org/jira/browse/HDFS-8341
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.6.0
> Environment: HDP 2.2
>Reporter: Hari Sekhon
>Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't 
> migrate the rest of the blocks.
> This is preventing recovery of data from a decomissioning external storage 
> tier used for archive (we've had problems with that proprietary "hyperscale" 
> storage product which is why a couple blocks here and there have checksum 
> problems or premature eof as shown below), but this should not prevent moving 
> all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
> {hdfs://nameservice1=[/apps/hive/warehouse/]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
> NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
> 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> 
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
> /default-rack/:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
> blk_1075156654_1438349 with size=134217728 from :1019:ARCHIVE to 
> :1019:DISK through :1019: block move is failed: opReplaceBlock 
> BP-120244285--1417023863606:blk_1075156654_1438349 received exception 
> java.io.EOFException: Premature EOF: no length prefix available
> ..
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8341) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external storage tier as a result

2015-05-11 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14537733#comment-14537733
 ] 

Hari Sekhon commented on HDFS-8341:
---

I had to move several thousand blocks by hand via scripting which was a 
significant proportion of the total blocks given that I had only put some 
limited amount of expendible data on the archive tier for testing. Given the 
dimensions of the data I'm certain it wasn't only successive blocks for one 
given file.

The command was looping on the same block, which also implies it never 
proceeded to try to move the blocks of the other files, hence the large number 
of blocks left behind and not moved back to the regular disk tier.

 HDFS mover stuck in loop after failing to move block, doesn't move rest of 
 blocks, can't get data back off decommissioning external storage tier as a 
 result
 

 Key: HDFS-8341
 URL: https://issues.apache.org/jira/browse/HDFS-8341
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer  mover
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Assignee: surendra singh lilhore
Priority: Blocker

 HDFS mover gets stuck looping on a block that fails to move and doesn't 
 migrate the rest of the blocks.
 This is preventing recovery of data from a decomissioning external storage 
 tier used for archive (we've had problems with that proprietary hyperscale 
 storage product which is why a couple blocks here and there have checksum 
 problems or premature eof as shown below), but this should not prevent moving 
 all the other blocks to recover our data:
 {code}hdfs mover -p /apps/hive/warehouse/custom_scrubbed
 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
 {hdfs://nameservice1=[/apps/hive/warehouse/custom_scrubbed]}
 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
 NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
 30mins, 0sec
 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
 blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
 ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
 BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
 java.io.EOFException: Premature EOF: no length prefix available
 NOW IT STARTS LOOPING ON SAME BLOCK
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
 blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
 ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
 BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
 java.io.EOFException: Premature EOF: no length prefix available
 ...repeat indefinitely...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8341) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external storage tier as a result

2015-05-07 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-8341:
-

 Summary: HDFS mover stuck in loop after failing to move block, 
doesn't move rest of blocks, can't get data back off decommissioning external 
storage tier as a result
 Key: HDFS-8341
 URL: https://issues.apache.org/jira/browse/HDFS-8341
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer  mover
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Blocker


HDFS mover gets stuck looping on a block that fails to move and doesn't migrate 
the rest of the blocks.

This is preventing recovery of data from a decomissioning external storage tier 
used for archive (we've had problems with that proprietary hyperscale storage 
product which is why a couple blocks here and there have checksum problems or 
premature eof as shown below), but this should not prevent moving all the other 
blocks to recover our data:
{code}hdfs mover -p /apps/hive/warehouse/custom_scrubbed
15/05/07 14:52:50 INFO mover.Mover: namenodes = 
{hdfs://nameservice1=[/apps/hive/warehouse/custom_scrubbed]}
15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
30mins, 0sec
15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
java.io.EOFException: Premature EOF: no length prefix available
NOW IT STARTS LOOPING ON SAME BLOCK
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
/default-rack/ip:1019
15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
java.io.EOFException: Premature EOF: no length prefix available
...repeat indefinitely...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8341) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external storage tier as a result

2015-05-07 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532730#comment-14532730
 ] 

Hari Sekhon commented on HDFS-8341:
---

I've done a manually scripted move of blocks under the hood in this scenario... 
but this functionality is effectively a blocker for hdfs storage tiering.

 HDFS mover stuck in loop after failing to move block, doesn't move rest of 
 blocks, can't get data back off decommissioning external storage tier as a 
 result
 

 Key: HDFS-8341
 URL: https://issues.apache.org/jira/browse/HDFS-8341
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer  mover
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Critical

 HDFS mover gets stuck looping on a block that fails to move and doesn't 
 migrate the rest of the blocks.
 This is preventing recovery of data from a decomissioning external storage 
 tier used for archive (we've had problems with that proprietary hyperscale 
 storage product which is why a couple blocks here and there have checksum 
 problems or premature eof as shown below), but this should not prevent moving 
 all the other blocks to recover our data:
 {code}hdfs mover -p /apps/hive/warehouse/custom_scrubbed
 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
 {hdfs://nameservice1=[/apps/hive/warehouse/custom_scrubbed]}
 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
 NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
 30mins, 0sec
 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
 blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
 ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
 BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
 java.io.EOFException: Premature EOF: no length prefix available
 NOW IT STARTS LOOPING ON SAME BLOCK
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
 blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
 ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
 BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
 java.io.EOFException: Premature EOF: no length prefix available
 ...repeat indefinitely...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8341) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external storage tier as a result

2015-05-07 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8341:
--
Priority: Critical  (was: Blocker)

 HDFS mover stuck in loop after failing to move block, doesn't move rest of 
 blocks, can't get data back off decommissioning external storage tier as a 
 result
 

 Key: HDFS-8341
 URL: https://issues.apache.org/jira/browse/HDFS-8341
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer  mover
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Critical

 HDFS mover gets stuck looping on a block that fails to move and doesn't 
 migrate the rest of the blocks.
 This is preventing recovery of data from a decomissioning external storage 
 tier used for archive (we've had problems with that proprietary hyperscale 
 storage product which is why a couple blocks here and there have checksum 
 problems or premature eof as shown below), but this should not prevent moving 
 all the other blocks to recover our data:
 {code}hdfs mover -p /apps/hive/warehouse/custom_scrubbed
 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
 {hdfs://nameservice1=[/apps/hive/warehouse/custom_scrubbed]}
 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
 NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
 30mins, 0sec
 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
 blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
 ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
 BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
 java.io.EOFException: Premature EOF: no length prefix available
 NOW IT STARTS LOOPING ON SAME BLOCK
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
 blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
 ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
 BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
 java.io.EOFException: Premature EOF: no length prefix available
 ...repeat indefinitely...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8341) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external storage tier as a result

2015-05-07 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8341:
--
Priority: Blocker  (was: Critical)

 HDFS mover stuck in loop after failing to move block, doesn't move rest of 
 blocks, can't get data back off decommissioning external storage tier as a 
 result
 

 Key: HDFS-8341
 URL: https://issues.apache.org/jira/browse/HDFS-8341
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer  mover
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Blocker

 HDFS mover gets stuck looping on a block that fails to move and doesn't 
 migrate the rest of the blocks.
 This is preventing recovery of data from a decomissioning external storage 
 tier used for archive (we've had problems with that proprietary hyperscale 
 storage product which is why a couple blocks here and there have checksum 
 problems or premature eof as shown below), but this should not prevent moving 
 all the other blocks to recover our data:
 {code}hdfs mover -p /apps/hive/warehouse/custom_scrubbed
 15/05/07 14:52:50 INFO mover.Mover: namenodes = 
 {hdfs://nameservice1=[/apps/hive/warehouse/custom_scrubbed]}
 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from 
 NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 
 30mins, 0sec
 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move 
 blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
 ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
 BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
 java.io.EOFException: Premature EOF: no length prefix available
 NOW IT STARTS LOOPING ON SAME BLOCK
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: 
 /default-rack/ip:1019
 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move 
 blk_1075156654_1438349 with size=134217728 from ip:1019:ARCHIVE to 
 ip:1019:DISK through ip:1019: block move is failed: opReplaceBlock 
 BP-120244285-ip-1417023863606:blk_1075156654_1438349 received exception 
 java.io.EOFException: Premature EOF: no length prefix available
 ...repeat indefinitely...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8299) HDFS reporting missing blocks when they are actually present due to read-only filesystem

2015-05-01 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522917#comment-14522917
 ] 

Hari Sekhon commented on HDFS-8299:
---

To clarify, a read-only filesystem should not prevent the blocks from being 
included in the block report to the NameNode and reported as existing, it 
should merely prevent new block writes to that partition until resolved.

 HDFS reporting missing blocks when they are actually present due to read-only 
 filesystem
 

 Key: HDFS-8299
 URL: https://issues.apache.org/jira/browse/HDFS-8299
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Critical
 Attachments: datanode.log


 Fsck shows missing blocks when the blocks can be found on a datanode's 
 filesystem and the datanode has been restarted to try to get it to recognize 
 that the blocks are indeed present and hence report them to the NameNode in a 
 block report.
 Fsck output showing an example missing block:
 {code}/apps/hive/warehouse/custom_scrubbed.db/someTable/00_0: CORRUPT 
 blockpool BP-120244285-ip-1417023863606 block blk_1075202330
  MISSING 1 blocks of total size 3260848 B
 0. BP-120244285-ip-1417023863606:blk_1075202330_1484191 len=3260848 
 MISSING!{code}
 The block is definitely present on more than one datanode however, here is 
 the output from one of them that I restarted to try to get it to report the 
 block to the NameNode:
 {code}# ll 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330*
 -rw-r--r-- 1 hdfs 499 3260848 Apr 27 15:02 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330
 -rw-r--r-- 1 hdfs 499   25483 Apr 27 15:02 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330_1484191.meta{code}
 It's worth noting that this is on HDFS tiered storage on an archive tier 
 going to a networked block device that may have become temporarily 
 unavailable but is available now. See also feature request HDFS-8297 for 
 online rescan to not have to go around restarting datanodes.
 It turns out in the datanode log (that I am attaching) this is because the 
 datanode fails to get a write lock on the filesystem. I think it would be 
 better to be able to read-only those blocks however, since this way causes 
 client visible data unavailability when the data could in fact be read.
 {code}2015-04-30 14:11:08,235 WARN  datanode.DataNode 
 (DataNode.java:checkStorageLocations(2284)) - Invalid dfs.datanode.data.dir 
 /archive1/dn :
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
 writable: /archive1/dn
 at 
 org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:193)
 at 
 org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
 at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:157)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2239)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2281)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2263)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378)
 at 
 org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.start(SecureDataNodeStarter.java:78)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:243)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8299) HDFS reporting missing blocks when they are actually present due to read-only filesystem

2015-05-01 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522913#comment-14522913
 ] 

Hari Sekhon commented on HDFS-8299:
---

Yes that's what I figured too, but I'm suggesting that just because a write 
lock cannot be obtained doesn't mean the blocks can't be read when they are 
clearly there.

Instead of causing user visible data unavailability it should still provide 
access to the data with any new writes going to other nodes/partitions. It 
would also need to be reported that the partition is in read-only state (due to 
some underlying ext4 filesystem issue) in the NameNode jsp / dfsadmin -report 
etc.

 HDFS reporting missing blocks when they are actually present due to read-only 
 filesystem
 

 Key: HDFS-8299
 URL: https://issues.apache.org/jira/browse/HDFS-8299
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Critical
 Attachments: datanode.log


 Fsck shows missing blocks when the blocks can be found on a datanode's 
 filesystem and the datanode has been restarted to try to get it to recognize 
 that the blocks are indeed present and hence report them to the NameNode in a 
 block report.
 Fsck output showing an example missing block:
 {code}/apps/hive/warehouse/custom_scrubbed.db/someTable/00_0: CORRUPT 
 blockpool BP-120244285-ip-1417023863606 block blk_1075202330
  MISSING 1 blocks of total size 3260848 B
 0. BP-120244285-ip-1417023863606:blk_1075202330_1484191 len=3260848 
 MISSING!{code}
 The block is definitely present on more than one datanode however, here is 
 the output from one of them that I restarted to try to get it to report the 
 block to the NameNode:
 {code}# ll 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330*
 -rw-r--r-- 1 hdfs 499 3260848 Apr 27 15:02 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330
 -rw-r--r-- 1 hdfs 499   25483 Apr 27 15:02 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330_1484191.meta{code}
 It's worth noting that this is on HDFS tiered storage on an archive tier 
 going to a networked block device that may have become temporarily 
 unavailable but is available now. See also feature request HDFS-8297 for 
 online rescan to not have to go around restarting datanodes.
 It turns out in the datanode log (that I am attaching) this is because the 
 datanode fails to get a write lock on the filesystem. I think it would be 
 better to be able to read-only those blocks however, since this way causes 
 client visible data unavailability when the data could in fact be read.
 {code}2015-04-30 14:11:08,235 WARN  datanode.DataNode 
 (DataNode.java:checkStorageLocations(2284)) - Invalid dfs.datanode.data.dir 
 /archive1/dn :
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
 writable: /archive1/dn
 at 
 org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:193)
 at 
 org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
 at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:157)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2239)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2281)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2263)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378)
 at 
 org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.start(SecureDataNodeStarter.java:78)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:243)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8297) Ability to online trigger data dir rescan for blocks

2015-04-30 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-8297:
-

 Summary: Ability to online trigger data dir rescan for blocks
 Key: HDFS-8297
 URL: https://issues.apache.org/jira/browse/HDFS-8297
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon


Feature request to add functionality to online trigger data dir rescan for 
available blocks without having to restart datanode.

Motivation is if using HDFS storage tiering with an archive tier to a separate 
hyperscale storage device over the network (Hedvig in this case) which may go 
away and then return due to say a network interruption or other temporary 
error, this leaves HDFS fsck declaring missing blocks, that are clearly visible 
on the mount point for the node's archive directory. An online trigger for data 
dir rescsan for available blocks would avoid having to do a rolling restart of 
all datanodes across a cluster. I did try sending a kill -HUP to the datanode 
process (both SecureDataNodeStarter parent and child) while tailing the log 
hoping this might do it, but nothing happened in the log.

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8299) HDFS reporting missing blocks when they are actually present due to read-only filesystem

2015-04-30 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8299:
--
Description: 
Fsck shows missing blocks when the blocks can be found on a datanode's 
filesystem and the datanode has been restarted to try to get it to recognize 
that the blocks are indeed present and hence report them to the NameNode in a 
block report.

Fsck output showing an example missing block:
{code}/apps/hive/warehouse/custom_scrubbed.db/someTable/00_0: CORRUPT 
blockpool BP-120244285-ip-1417023863606 block blk_1075202330
 MISSING 1 blocks of total size 3260848 B
0. BP-120244285-ip-1417023863606:blk_1075202330_1484191 len=3260848 
MISSING!{code}
The block is definitely present on more than one datanode however, here is the 
output from one of them that I restarted to try to get it to report the block 
to the NameNode:
{code}# ll 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330*
-rw-r--r-- 1 hdfs 499 3260848 Apr 27 15:02 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330
-rw-r--r-- 1 hdfs 499   25483 Apr 27 15:02 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330_1484191.meta{code}
It's worth noting that this is on HDFS tiered storage on an archive tier going 
to a networked block device that may have become temporarily unavailable but is 
available now. See also feature request HDFS-8297 for online rescan to not have 
to go around restarting datanodes.

It turns out in the datanode log (that I am attaching) this is because the 
datanode fails to get a write lock on the filesystem. I think it would be 
better to be able to read-only those blocks however, since this way causes 
client visible data unavailability when the data could in fact be read.

{code}2015-04-30 14:11:08,235 WARN  datanode.DataNode 
(DataNode.java:checkStorageLocations(2284)) - Invalid dfs.datanode.data.dir 
/archive1/dn :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
writable: /archive1/dn
at 
org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:193)
at 
org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:157)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2239)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2281)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2263)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378)
at 
org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.start(SecureDataNodeStarter.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:243)
{code}

Hari Sekhon
http://www.linkedin.com/in/harisekhon

 HDFS reporting missing blocks when they are actually present due to read-only 
 filesystem
 

 Key: HDFS-8299
 URL: https://issues.apache.org/jira/browse/HDFS-8299
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Critical
 Attachments: datanode.log


 Fsck shows missing blocks when the blocks can be found on a datanode's 
 filesystem and the datanode has been restarted to try to get it to recognize 
 that the blocks are indeed present and hence report them to the NameNode in a 
 block report.
 Fsck output showing an example missing block:
 {code}/apps/hive/warehouse/custom_scrubbed.db/someTable/00_0: CORRUPT 
 blockpool BP-120244285-ip-1417023863606 block blk_1075202330
  MISSING 1 blocks of total size 3260848 B
 0. BP-120244285-ip-1417023863606:blk_1075202330_1484191 len=3260848 
 MISSING!{code}
 The block is definitely present on more than one datanode however, here is 
 the output from one of them that I restarted to try to get it to report the 
 block to the NameNode:
 {code}# ll 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330*
 

[jira] [Updated] (HDFS-8299) HDFS reporting missing blocks when they are actually present due to read-only filesystem

2015-04-30 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8299:
--
Attachment: datanode.log

 HDFS reporting missing blocks when they are actually present due to read-only 
 filesystem
 

 Key: HDFS-8299
 URL: https://issues.apache.org/jira/browse/HDFS-8299
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
 Environment: Fsck shows missing blocks when the blocks can be found 
 on a datanode's filesystem and the datanode has been restarted to try to get 
 it to recognize that the blocks are indeed present and hence report them to 
 the NameNode in a block report.
 Fsck output showing an example missing block:
 {code}/apps/hive/warehouse/custom_scrubbed.db/someTable/00_0: CORRUPT 
 blockpool BP-120244285-ip-1417023863606 block blk_1075202330
  MISSING 1 blocks of total size 3260848 B
 0. BP-120244285-ip-1417023863606:blk_1075202330_1484191 len=3260848 
 MISSING!{code}
 The block is definitely present on more than one datanode however, here is 
 the output from one of them that I restarted to try to get it to report the 
 block to the NameNode:
 {code}# ll 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330*
 -rw-r--r-- 1 hdfs 499 3260848 Apr 27 15:02 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330
 -rw-r--r-- 1 hdfs 499   25483 Apr 27 15:02 
 /archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330_1484191.meta{code}
 It's worth noting that this is on HDFS tiered storage on an archive tier 
 going to a networked block device that may have become temporarily 
 unavailable but is available now. See also feature request HDFS-8297 for 
 online rescan to not have to go around restarting datanodes.
 It turns out in the datanode log (that I am attaching) this is because the 
 datanode fails to get a write lock on the filesystem. I think it would be 
 better to be able to read-only those blocks however, since this way causes 
 client visible data unavailability when the data could in fact be read.
 {code}2015-04-30 14:11:08,235 WARN  datanode.DataNode 
 (DataNode.java:checkStorageLocations(2284)) - Invalid dfs.datanode.data.dir 
 /archive1/dn :
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
 writable: /archive1/dn
 at 
 org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:193)
 at 
 org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
 at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:157)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2239)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2281)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2263)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378)
 at 
 org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.start(SecureDataNodeStarter.java:78)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:243)
 {code}
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon
Reporter: Hari Sekhon
Priority: Critical
 Attachments: datanode.log






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8299) HDFS reporting missing blocks when they are actually present due to read-only filesystem

2015-04-30 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8299:
--
Environment: HDP 2.2  (was: Fsck shows missing blocks when the blocks can 
be found on a datanode's filesystem and the datanode has been restarted to try 
to get it to recognize that the blocks are indeed present and hence report them 
to the NameNode in a block report.

Fsck output showing an example missing block:
{code}/apps/hive/warehouse/custom_scrubbed.db/someTable/00_0: CORRUPT 
blockpool BP-120244285-ip-1417023863606 block blk_1075202330
 MISSING 1 blocks of total size 3260848 B
0. BP-120244285-ip-1417023863606:blk_1075202330_1484191 len=3260848 
MISSING!{code}
The block is definitely present on more than one datanode however, here is the 
output from one of them that I restarted to try to get it to report the block 
to the NameNode:
{code}# ll 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330*
-rw-r--r-- 1 hdfs 499 3260848 Apr 27 15:02 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330
-rw-r--r-- 1 hdfs 499   25483 Apr 27 15:02 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330_1484191.meta{code}
It's worth noting that this is on HDFS tiered storage on an archive tier going 
to a networked block device that may have become temporarily unavailable but is 
available now. See also feature request HDFS-8297 for online rescan to not have 
to go around restarting datanodes.

It turns out in the datanode log (that I am attaching) this is because the 
datanode fails to get a write lock on the filesystem. I think it would be 
better to be able to read-only those blocks however, since this way causes 
client visible data unavailability when the data could in fact be read.

{code}2015-04-30 14:11:08,235 WARN  datanode.DataNode 
(DataNode.java:checkStorageLocations(2284)) - Invalid dfs.datanode.data.dir 
/archive1/dn :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
writable: /archive1/dn
at 
org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:193)
at 
org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:157)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2239)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2281)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2263)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378)
at 
org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.start(SecureDataNodeStarter.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:243)
{code}

Hari Sekhon
http://www.linkedin.com/in/harisekhon)

 HDFS reporting missing blocks when they are actually present due to read-only 
 filesystem
 

 Key: HDFS-8299
 URL: https://issues.apache.org/jira/browse/HDFS-8299
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Critical
 Attachments: datanode.log






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8299) HDFS reporting missing blocks when they are actually present due to read-only filesystem

2015-04-30 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-8299:
-

 Summary: HDFS reporting missing blocks when they are actually 
present due to read-only filesystem
 Key: HDFS-8299
 URL: https://issues.apache.org/jira/browse/HDFS-8299
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.6.0
 Environment: Fsck shows missing blocks when the blocks can be found on 
a datanode's filesystem and the datanode has been restarted to try to get it to 
recognize that the blocks are indeed present and hence report them to the 
NameNode in a block report.

Fsck output showing an example missing block:
{code}/apps/hive/warehouse/custom_scrubbed.db/someTable/00_0: CORRUPT 
blockpool BP-120244285-ip-1417023863606 block blk_1075202330
 MISSING 1 blocks of total size 3260848 B
0. BP-120244285-ip-1417023863606:blk_1075202330_1484191 len=3260848 
MISSING!{code}
The block is definitely present on more than one datanode however, here is the 
output from one of them that I restarted to try to get it to report the block 
to the NameNode:
{code}# ll 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330*
-rw-r--r-- 1 hdfs 499 3260848 Apr 27 15:02 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330
-rw-r--r-- 1 hdfs 499   25483 Apr 27 15:02 
/archive1/dn/current/BP-120244285-ip-1417023863606/current/finalized/subdir22/subdir73/blk_1075202330_1484191.meta{code}
It's worth noting that this is on HDFS tiered storage on an archive tier going 
to a networked block device that may have become temporarily unavailable but is 
available now. See also feature request HDFS-8297 for online rescan to not have 
to go around restarting datanodes.

It turns out in the datanode log (that I am attaching) this is because the 
datanode fails to get a write lock on the filesystem. I think it would be 
better to be able to read-only those blocks however, since this way causes 
client visible data unavailability when the data could in fact be read.

{code}2015-04-30 14:11:08,235 WARN  datanode.DataNode 
(DataNode.java:checkStorageLocations(2284)) - Invalid dfs.datanode.data.dir 
/archive1/dn :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
writable: /archive1/dn
at 
org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:193)
at 
org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:157)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataNodeDiskChecker.checkDir(DataNode.java:2239)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.checkStorageLocations(DataNode.java:2281)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2263)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378)
at 
org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter.start(SecureDataNodeStarter.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.commons.daemon.support.DaemonLoader.start(DaemonLoader.java:243)
{code}

Hari Sekhon
http://www.linkedin.com/in/harisekhon
Reporter: Hari Sekhon
Priority: Critical
 Attachments: datanode.log





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary failures

2015-04-30 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-8298:
-

 Summary: HA: NameNode should not shut down completely without 
quorum, doesn't recover from temporary failures
 Key: HDFS-8298
 URL: https://issues.apache.org/jira/browse/HDFS-8298
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha, HDFS, namenode, qjm
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon


In an HDFS HA setup if there is a temporary problem with contacting journal 
nodes (eg. network interruption), the NameNode shuts down entirely, when it 
should instead go in to a standby mode so that it can stay online and retry to 
achieve quorum later.

If both NameNodes shut themselves off like this then even after the temporary 
network outage is resolved, the entire cluster remains offline indefinitely 
until operator intervention, whereas it could have self-repaired after 
re-contacting the journalnodes and re-achieving quorum.

{code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
(JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
required journal (JournalAndStre
am(mgr=QJM to [ip:8485, ip:8485, ip:8485], stream=QuorumOutputStream 
starting at txid 54270281))
java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
respond.
at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
at 
org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at 
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at 
org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at 
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
at java.lang.Thread.run(Thread.java:745)
2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
(QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
txid 54270281
2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
Exiting with status 1
2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at custom_scrubbed/ip
/{code}

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8298) HA: NameNode should not shut down completely without quorum, doesn't recover from temporary network outages

2015-04-30 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8298:
--
Summary: HA: NameNode should not shut down completely without quorum, 
doesn't recover from temporary network outages  (was: HA: NameNode should not 
shut down completely without quorum, doesn't recover from temporary failures)

 HA: NameNode should not shut down completely without quorum, doesn't recover 
 from temporary network outages
 ---

 Key: HDFS-8298
 URL: https://issues.apache.org/jira/browse/HDFS-8298
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ha, HDFS, namenode, qjm
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon

 In an HDFS HA setup if there is a temporary problem with contacting journal 
 nodes (eg. network interruption), the NameNode shuts down entirely, when it 
 should instead go in to a standby mode so that it can stay online and retry 
 to achieve quorum later.
 If both NameNodes shut themselves off like this then even after the temporary 
 network outage is resolved, the entire cluster remains offline indefinitely 
 until operator intervention, whereas it could have self-repaired after 
 re-contacting the journalnodes and re-achieving quorum.
 {code}2015-04-15 15:59:26,900 FATAL namenode.FSEditLog 
 (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for 
 required journal (JournalAndStre
 am(mgr=QJM to [ip:8485, ip:8485, ip:8485], stream=QuorumOutputStream 
 starting at txid 54270281))
 java.io.IOException: Interrupted waiting 2ms for a quorum of nodes to 
 respond.
 at 
 org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:134)
 at 
 org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
 at 
 org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
 at 
 org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
 at 
 org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
 at 
 org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:639)
 at 
 org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:388)
 at java.lang.Thread.run(Thread.java:745)
 2015-04-15 15:59:26,901 WARN  client.QuorumJournalManager 
 (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at 
 txid 54270281
 2015-04-15 15:59:26,904 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
 Exiting with status 1
 2015-04-15 15:59:27,001 INFO  namenode.NameNode (StringUtils.java:run(659)) - 
 SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at custom_scrubbed/ip
 /{code}
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8277) Safemode enter fails when Standby NameNode is down

2015-04-28 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-8277:
-

 Summary: Safemode enter fails when Standby NameNode is down
 Key: HDFS-8277
 URL: https://issues.apache.org/jira/browse/HDFS-8277
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, HDFS, namenode
Affects Versions: 2.6.0
 Environment: HDP 2.2.0
Reporter: Hari Sekhon


HDFS fails to enter safemode when the Standby NameNode is down (eg. due to 
AMBARI-10536).
{code}hdfs dfsadmin -safemode enter
safemode: Call From nn2/x.x.x.x to nn1:8020 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused{code}
This appears to be a bug in that it's not trying both NameNodes like the 
standard hdfs client code does, and is instead stopping after getting a 
connection refused from nn1 which is down. I verified normal hadoop fs writes 
and reads via cli did work at this time, using nn2. I happened to run this 
command as the hdfs user on nn2 which was the surviving Active NameNode.

After I re-bootstrapped the Standby NN to fix it the command worked as expected 
again.

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8277) Safemode enter fails when Standby NameNode is down

2015-04-28 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-8277:
--
Priority: Minor  (was: Major)

 Safemode enter fails when Standby NameNode is down
 --

 Key: HDFS-8277
 URL: https://issues.apache.org/jira/browse/HDFS-8277
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, HDFS, namenode
Affects Versions: 2.6.0
 Environment: HDP 2.2.0
Reporter: Hari Sekhon
Priority: Minor

 HDFS fails to enter safemode when the Standby NameNode is down (eg. due to 
 AMBARI-10536).
 {code}hdfs dfsadmin -safemode enter
 safemode: Call From nn2/x.x.x.x to nn1:8020 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused{code}
 This appears to be a bug in that it's not trying both NameNodes like the 
 standard hdfs client code does, and is instead stopping after getting a 
 connection refused from nn1 which is down. I verified normal hadoop fs writes 
 and reads via cli did work at this time, using nn2. I happened to run this 
 command as the hdfs user on nn2 which was the surviving Active NameNode.
 After I re-bootstrapped the Standby NN to fix it the command worked as 
 expected again.
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8277) Safemode enter fails when Standby NameNode is down

2015-04-28 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517333#comment-14517333
 ] 

Hari Sekhon commented on HDFS-8277:
---

Ah, I have both back up now so this command works regardless, won't be a great 
test.

Perhaps this should be labelled improvement instead of bug since other hdfs 
commands do auto-failover for HA setups.

 Safemode enter fails when Standby NameNode is down
 --

 Key: HDFS-8277
 URL: https://issues.apache.org/jira/browse/HDFS-8277
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha, HDFS, namenode
Affects Versions: 2.6.0
 Environment: HDP 2.2.0
Reporter: Hari Sekhon

 HDFS fails to enter safemode when the Standby NameNode is down (eg. due to 
 AMBARI-10536).
 {code}hdfs dfsadmin -safemode enter
 safemode: Call From nn2/x.x.x.x to nn1:8020 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused{code}
 This appears to be a bug in that it's not trying both NameNodes like the 
 standard hdfs client code does, and is instead stopping after getting a 
 connection refused from nn1 which is down. I verified normal hadoop fs writes 
 and reads via cli did work at this time, using nn2. I happened to run this 
 command as the hdfs user on nn2 which was the surviving Active NameNode.
 After I re-bootstrapped the Standby NN to fix it the command worked as 
 expected again.
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8244) HDFS Custom Storage Tier Policies

2015-04-24 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-8244:
-

 Summary: HDFS Custom Storage Tier Policies
 Key: HDFS-8244
 URL: https://issues.apache.org/jira/browse/HDFS-8244
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: balancer  mover, datanode, HDFS, hdfs-client, namenode
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor


Feature request to be able to define custom HDFS storage policies.

For example, being able to define DISK:2, Archive:n - 2.

Motivation for this is when integrating the archive tier on another cheaper 
storage system such as Hedvig which we are not in control of and want to hedge 
our bets in case something goes wrong with that archive storage system (it's 
new and unproven) we don't want just one copy of the data left on our cluster 
in case we lose a node.

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-2115) Transparent compression in HDFS

2015-03-10 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14355349#comment-14355349
 ] 

Hari Sekhon commented on HDFS-2115:
---

MapR-FS provides transparent compression at the filesystem level - it's a very 
good idea.

It could be done on a directory basis (like MapR) with specific subdirectory 
and file / file extension exclusions, such as a .ignore_compress file in the 
directory.

Keeping files in plain text format makes it easier to use different tools on 
them without worrying about codec or container format support etc, but 
currently one can pay an 8x storage penalty for keeping uncompressed text.

This would solve some real problems for us right now if we had it. It's also 
annoying that many tools are always showing reading textfiles but this is so 
costly on storage without this transparent compression. We actually are stuck 
with a large historical archive of compressed files we can't work with (no zip 
inputformat) and can't leave them uncompressed either because of the storage 
waste which would exceed our cluster capacity. Having to reprocess them all to 
convert to different compression and then hope all future tools can handle that 
format is far less ideal than just having transparent compression.

The increasing proliferation of tools and products on Hadoop exacerbates this 
issue as we can never be sure that the next tool will support format X. 
Everything supports text. Please add transparent compression to make working 
with text better.

Regards,

Hari Sekhon
http://www.linkedin.com/in/harisekhon

 Transparent compression in HDFS
 ---

 Key: HDFS-2115
 URL: https://issues.apache.org/jira/browse/HDFS-2115
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode, hdfs-client
Reporter: Todd Lipcon

 In practice, we find that a lot of users store text data in HDFS without 
 using any compression codec. Improving usability of compressible formats like 
 Avro/RCFile helps with this, but we could also help many users by providing 
 an option to transparently compress data as it is stored.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7467) Provide storage tier information for a directory via fsck

2014-12-24 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258172#comment-14258172
 ] 

Hari Sekhon commented on HDFS-7467:
---

There does need to be a way to figure out if a given file or directory of files 
are using fallback storage.

There should also be a global method of seeing if any files are using fallback 
storage as an indicator that there isn't enough SSD for example.

Adding this information to fsck seems like a sensible way to go - the main 
question is how to represent that information concisely.

Are all storage policies in fallback storage equivalent to other storage 
policies that this output can always be fully described by the percentages that 
Tsz has suggested?

There should also be some warning messages as well in fsck for all files that 
are unable to meet the requested ideal for their storage policy and are using 
fallback storage, perhaps with a switch since that could become overly volumous 
output.

Regards,

Hari Sekhon
http://www.linkedin.com/in/harisekhon

 Provide storage tier information for a directory via fsck
 -

 Key: HDFS-7467
 URL: https://issues.apache.org/jira/browse/HDFS-7467
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: balancer  mover
Affects Versions: 2.6.0
Reporter: Benoy Antony
Assignee: Benoy Antony
 Attachments: HDFS-7467.patch


 Currently _fsck_  provides information regarding blocks for a directory.
 It should be augmented to provide storage tier information (optionally). 
 The sample report could be as follows :
 {code}
 Storage Tier Combination# of blocks   % of blocks
 DISK:1,ARCHIVE:2  340730   97.7393%
  
 ARCHIVE:3   39281.1268%
  
 DISK:2,ARCHIVE:231220.8956%
  
 DISK:2,ARCHIVE:1 7480.2146%
  
 DISK:1,ARCHIVE:3  440.0126%
  
 DISK:3,ARCHIVE:2  300.0086%
  
 DISK:3,ARCHIVE:1   90.0026%
 {code}
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7563) NFS gateway parseStaticMap NumberFormatException

2014-12-23 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256787#comment-14256787
 ] 

Hari Sekhon commented on HDFS-7563:
---

I believe it's similar but not the same since that was in a different class 
updateMapInternal in 2.4.0 and iirc this static mapping functionality for the 
NFS gateway was added by ATM in version 2.5.0 and I'm running 2.6.0 whereas 
HDFS-6361 was fixed in 2.4.1 earlier this year.

I also followed the java stack trace to the code itself as detailed above to 
see the line as described above which is why I believe it's an int vs long 
issue on that recently added static mapping functionality from 2.5.0.

Reproducing should be as simple as adding a 4 billion UID to /etc/nfs.map and 
restarting HDFS NFS gateway.

 NFS gateway parseStaticMap NumberFormatException
 

 Key: HDFS-7563
 URL: https://issues.apache.org/jira/browse/HDFS-7563
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Assignee: Aaron T. Myers
 Attachments: UID_GID_Long_HashMaps.patch


 When using the new NFS UID mapping for the HDFS NFS gateway I've discovered 
 that my Windows 7 workstation at this bank is passing UID number 4294xx 
 but entering this in the /etc/nfs.map in order to remap that to a Hadoop UID 
 prevents the NFS gateway from restarting with the error message:
 {code}Exception in thread main java.lang.NumberFormatException: For input 
 string: 4294xx
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:495)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.parseStaticMap(ShellBasedIdMapping.java:318)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.updateMaps(ShellBasedIdMapping.java:229)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.init(ShellBasedIdMapping.java:91)
 at 
 org.apache.hadoop.hdfs.nfs.nfs3.RpcProgramNfs3.init(RpcProgramNfs3.java:176)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.init(Nfs3.java:45)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:66)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:72)
 {code}
 The /etc/nfs.map file simply contains
 {code}
 uid 4294xx 1
 {code}
 It seems that the code at 
 {code}hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/ShellBasedIdMapping.java{code}
 is expecting an integer at line 318 of the parseStaticMap method: {code}int 
 remoteId = Integer.parseInt(lineMatcher.group(2));
 int localId = Integer.parseInt(lineMatcher.group(3));{code}
 This UID does seem very high to me but it has worked successfully on a 
 MapR-FS NFS share and stores files created with that UID over NFS.
 The UID / GID mappings for the HDFS NFS gateway will need to be switched to 
 using Long to accomodate this, I've attached a patch for the parsing and 
 UID/GID HashMaps.
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7565) NFS gateway UID overflow

2014-12-23 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-7565:
--
Environment: HDP 2.2 (Apache Hadoop 2.6.0)  (was: HDP 2.2)

 NFS gateway UID overflow
 

 Key: HDFS-7565
 URL: https://issues.apache.org/jira/browse/HDFS-7565
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2 (Apache Hadoop 2.6.0)
Reporter: Hari Sekhon

 It appears that my Windows 7 workstation is passing a UID around 4 billion to 
 the NFS gateway and the getUserName() method is being passed -2, so it 
 looks like the UID is an int and is overflowing:
 {code}security.ShellBasedIdMapping 
 (ShellBasedIdMapping.java:getUserName(358)) - Can't find user name for uid 
 -2. Use default user name nobody{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7565) NFS gateway UID overflow

2014-12-23 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-7565:
--
Description: 
It appears that my Windows 7 workstation is passing a UID around 4 billion to 
the NFS gateway and the getUserName() method is being passed -2, so it looks 
like the UID is an int and is overflowing:
{code}security.ShellBasedIdMapping (ShellBasedIdMapping.java:getUserName(358)) 
- Can't find user name for uid -2. Use default user name nobody{code}

Regards,

Hari Sekhon
http://www.linkedin.com/in/harisekon

  was:
It appears that my Windows 7 workstation is passing a UID around 4 billion to 
the NFS gateway and the getUserName() method is being passed -2, so it looks 
like the UID is an int and is overflowing:
{code}security.ShellBasedIdMapping (ShellBasedIdMapping.java:getUserName(358)) 
- Can't find user name for uid -2. Use default user name nobody{code}


 NFS gateway UID overflow
 

 Key: HDFS-7565
 URL: https://issues.apache.org/jira/browse/HDFS-7565
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2 (Apache Hadoop 2.6.0)
Reporter: Hari Sekhon

 It appears that my Windows 7 workstation is passing a UID around 4 billion to 
 the NFS gateway and the getUserName() method is being passed -2, so it 
 looks like the UID is an int and is overflowing:
 {code}security.ShellBasedIdMapping 
 (ShellBasedIdMapping.java:getUserName(358)) - Can't find user name for uid 
 -2. Use default user name nobody{code}
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7565) NFS gateway UID overflow

2014-12-23 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256794#comment-14256794
 ] 

Hari Sekhon commented on HDFS-7565:
---

I'm running 2.6.0 and it looks like HDFS-6361 was supposed to be fixed in 2.4.1.

Whereas HDFS-7563 is a startup failure relating to the NFS gateway static 
UID/GID map parsing function, this error occurs whenever I access the NFS mount 
point from my Windows 7 workstation, which makes me think it's a different code 
path.

 NFS gateway UID overflow
 

 Key: HDFS-7565
 URL: https://issues.apache.org/jira/browse/HDFS-7565
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2 (Apache Hadoop 2.6.0)
Reporter: Hari Sekhon

 It appears that my Windows 7 workstation is passing a UID around 4 billion to 
 the NFS gateway and the getUserName() method is being passed -2, so it 
 looks like the UID is an int and is overflowing:
 {code}security.ShellBasedIdMapping 
 (ShellBasedIdMapping.java:getUserName(358)) - Can't find user name for uid 
 -2. Use default user name nobody{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7565) NFS gateway UID overflow

2014-12-23 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14257160#comment-14257160
 ] 

Hari Sekhon commented on HDFS-7565:
---

That 4 Billion UID is being sent by my Windows 7 workstation when accesing the 
NFS mount point.

It was difficult to find this out (I found out because my other MapR+NFS 
cluster just handled it and I saw the UID on file creation) which is why I went 
looking to find out why I couldn't see that 4 billion UID in the HDFS NFS 
gateway logs and then I read the code and realized that it's showing me this -2 
instead of the 4 billion (hence it's overflowing around on java int).

That static mapping issue is separate issue where trying to then map the 4 
billion UID prevents gateway startup as documented in HDFS-7563.

This issue in this Jira is really to answer the question: Why can't I find 
this UID 4 billion that my client was passing to my HDFS NFS gateway. I 
believe the answer is because the number was being lost in integer overflow.

 NFS gateway UID overflow
 

 Key: HDFS-7565
 URL: https://issues.apache.org/jira/browse/HDFS-7565
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2 (Apache Hadoop 2.6.0)
Reporter: Hari Sekhon
Assignee: Yongjun Zhang

 It appears that my Windows 7 workstation is passing a UID around 4 billion to 
 the NFS gateway and the getUserName() method is being passed -2, so it 
 looks like the UID is an int and is overflowing:
 {code}security.ShellBasedIdMapping 
 (ShellBasedIdMapping.java:getUserName(358)) - Can't find user name for uid 
 -2. Use default user name nobody{code}
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7565) NFS gateway UID overflow

2014-12-23 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14257252#comment-14257252
 ] 

Hari Sekhon commented on HDFS-7565:
---

1. It's simply converting the 4 billion entry to -2 so yes that isn't going to 
match anything, which is the crux of this ticket. That's the bit we need fixed 
in this jira.

2. I did previously enable SSSD enumeration (I had previously seen that ticket).

A1. UID 4294967294 (from company Active Directory), username on the cluster is 
hari (this is a local IPA realm just for this PoC)
A2. cluster is running on Redhat Enterprise Linux 6, client is Windows 7 
enterprise
A3. Linux getent passwd | grep hari returns 
hari:*:10002:10003:hari:/home/user/hari:/bin/bash

 NFS gateway UID overflow
 

 Key: HDFS-7565
 URL: https://issues.apache.org/jira/browse/HDFS-7565
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2 (Apache Hadoop 2.6.0)
Reporter: Hari Sekhon
Assignee: Yongjun Zhang

 It appears that my Windows 7 workstation is passing a UID around 4 billion to 
 the NFS gateway and the getUserName() method is being passed -2, so it 
 looks like the UID is an int and is overflowing:
 {code}security.ShellBasedIdMapping 
 (ShellBasedIdMapping.java:getUserName(358)) - Can't find user name for uid 
 -2. Use default user name nobody{code}
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7565) NFS gateway UID overflow

2014-12-23 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14257551#comment-14257551
 ] 

Hari Sekhon commented on HDFS-7565:
---

To clarify:

The Windows 7 workstation is part of the enterprise Active Directory and is 
what is sending that 4 billion UID. My windows user is some random alphanumeric 
id not relevant here. 

The hari user was created by me in this standalone integrated Kerberos realm 
with backing LDAP directory which manages the Hadoop cluster. The Hadoop 
cluster authenticates to this standalone authentication realm created by me.

So the hari user and its 10002 UID were set up by me and appear as local 
native user on the cluster nodes.

This is why you see the two UIDs. What I was trying to do was map the 4 billion 
UID to the 10002 UID using the recently added static map in the other jira 
HDFS-7563.

The problem here is that when the NFS gateway receives the 4 billion UID from 
the Windows workstation it overflows and results in the -2 UID.

According to HDFS-6361 that -2 is set because it's outside the int range, but 
it's not clear what the fix is in this instance since it was marked resolved in 
version 2.4.1 and I'm running 2.6.0.

 NFS gateway UID overflow
 

 Key: HDFS-7565
 URL: https://issues.apache.org/jira/browse/HDFS-7565
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2 (Apache Hadoop 2.6.0)
Reporter: Hari Sekhon
Assignee: Yongjun Zhang

 It appears that my Windows 7 workstation is passing a UID around 4 billion to 
 the NFS gateway and the getUserName() method is being passed -2, so it 
 looks like the UID is an int and is overflowing:
 {code}security.ShellBasedIdMapping 
 (ShellBasedIdMapping.java:getUserName(358)) - Can't find user name for uid 
 -2. Use default user name nobody{code}
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7565) NFS gateway UID overflow

2014-12-23 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14257557#comment-14257557
 ] 

Hari Sekhon commented on HDFS-7565:
---

Also, note that 4 billion UID does not result in user hari - that's the user  
I'm trying to make it resolve to via the static mapping file in the other 
ticket.

That 4 billion UID results in user nobody with UID -2. This is the problem 
right here since nobody user cannot access any of the data in the cluster, 
nor can I grant any granular access to different users and groups since they'll 
all come out as user nobody so there is no way to separate users.

 NFS gateway UID overflow
 

 Key: HDFS-7565
 URL: https://issues.apache.org/jira/browse/HDFS-7565
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2 (Apache Hadoop 2.6.0)
Reporter: Hari Sekhon
Assignee: Yongjun Zhang

 It appears that my Windows 7 workstation is passing a UID around 4 billion to 
 the NFS gateway and the getUserName() method is being passed -2, so it 
 looks like the UID is an int and is overflowing:
 {code}security.ShellBasedIdMapping 
 (ShellBasedIdMapping.java:getUserName(358)) - Can't find user name for uid 
 -2. Use default user name nobody{code}
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7563) NFS gateway parseStaticMap NumberFormatException

2014-12-22 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-7563:
-

 Summary: NFS gateway parseStaticMap NumberFormatException
 Key: HDFS-7563
 URL: https://issues.apache.org/jira/browse/HDFS-7563
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon


When using the new NFS UID mapping for the HDFS NFS gateway I've discovered 
that my Windows 7 workstation at this bank is passing UID number 4294xx but 
entering this in the /etc/nfs.map in order to remap that to a Hadoop UID 
prevents the NFS gateway from restarting with the error message:

{code}Exception in thread main java.lang.NumberFormatException: For input 
string: 4294xx
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:495)
at java.lang.Integer.parseInt(Integer.java:527)
at 
org.apache.hadoop.security.ShellBasedIdMapping.parseStaticMap(ShellBasedIdMapping.java:318)
at 
org.apache.hadoop.security.ShellBasedIdMapping.updateMaps(ShellBasedIdMapping.java:229)
at 
org.apache.hadoop.security.ShellBasedIdMapping.init(ShellBasedIdMapping.java:91)
at 
org.apache.hadoop.hdfs.nfs.nfs3.RpcProgramNfs3.init(RpcProgramNfs3.java:176)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.init(Nfs3.java:45)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:66)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:72)
{code}
The /etc/nfs.map file simply contains
{code}
uid 4294xx 1
{code}
It seems that the code at 
{code}hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/ShellBasedIdMapping.java{code}
is expecting an integer at line 318 of the parseStaticMap method: {code}int 
remoteId = Integer.parseInt(lineMatcher.group(2));
int localId = Integer.parseInt(lineMatcher.group(3));{code}

This UID does seem very high to me but it has worked successfully on a MapR-FS 
NFS share and stores files created with that UID over NFS.

The UID / GID mappings for the HDFS NFS gateway will need to be switched to 
using Long to accomodate this, I've attached a patch for the parsing and 
UID/GID HashMaps.

Regards,

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7563) NFS gateway parseStaticMap NumberFormatException

2014-12-22 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-7563:
--
Attachment: UID_GID_Long_HashMaps.patch

Patch for Int = Long UID/GID mapping HashMaps

 NFS gateway parseStaticMap NumberFormatException
 

 Key: HDFS-7563
 URL: https://issues.apache.org/jira/browse/HDFS-7563
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
 Attachments: UID_GID_Long_HashMaps.patch


 When using the new NFS UID mapping for the HDFS NFS gateway I've discovered 
 that my Windows 7 workstation at this bank is passing UID number 4294xx 
 but entering this in the /etc/nfs.map in order to remap that to a Hadoop UID 
 prevents the NFS gateway from restarting with the error message:
 {code}Exception in thread main java.lang.NumberFormatException: For input 
 string: 4294xx
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:495)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.parseStaticMap(ShellBasedIdMapping.java:318)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.updateMaps(ShellBasedIdMapping.java:229)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.init(ShellBasedIdMapping.java:91)
 at 
 org.apache.hadoop.hdfs.nfs.nfs3.RpcProgramNfs3.init(RpcProgramNfs3.java:176)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.init(Nfs3.java:45)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:66)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:72)
 {code}
 The /etc/nfs.map file simply contains
 {code}
 uid 4294xx 1
 {code}
 It seems that the code at 
 {code}hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/ShellBasedIdMapping.java{code}
 is expecting an integer at line 318 of the parseStaticMap method: {code}int 
 remoteId = Integer.parseInt(lineMatcher.group(2));
 int localId = Integer.parseInt(lineMatcher.group(3));{code}
 This UID does seem very high to me but it has worked successfully on a 
 MapR-FS NFS share and stores files created with that UID over NFS.
 The UID / GID mappings for the HDFS NFS gateway will need to be switched to 
 using Long to accomodate this, I've attached a patch for the parsing and 
 UID/GID HashMaps.
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7563) NFS gateway parseStaticMap NumberFormatException

2014-12-22 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-7563:
--
Assignee: Aaron T. Myers
  Status: Patch Available  (was: Open)

Quick patch, not tested since I don't have the build infrastructure

 NFS gateway parseStaticMap NumberFormatException
 

 Key: HDFS-7563
 URL: https://issues.apache.org/jira/browse/HDFS-7563
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Assignee: Aaron T. Myers
 Attachments: UID_GID_Long_HashMaps.patch


 When using the new NFS UID mapping for the HDFS NFS gateway I've discovered 
 that my Windows 7 workstation at this bank is passing UID number 4294xx 
 but entering this in the /etc/nfs.map in order to remap that to a Hadoop UID 
 prevents the NFS gateway from restarting with the error message:
 {code}Exception in thread main java.lang.NumberFormatException: For input 
 string: 4294xx
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:495)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.parseStaticMap(ShellBasedIdMapping.java:318)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.updateMaps(ShellBasedIdMapping.java:229)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.init(ShellBasedIdMapping.java:91)
 at 
 org.apache.hadoop.hdfs.nfs.nfs3.RpcProgramNfs3.init(RpcProgramNfs3.java:176)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.init(Nfs3.java:45)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:66)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:72)
 {code}
 The /etc/nfs.map file simply contains
 {code}
 uid 4294xx 1
 {code}
 It seems that the code at 
 {code}hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/ShellBasedIdMapping.java{code}
 is expecting an integer at line 318 of the parseStaticMap method: {code}int 
 remoteId = Integer.parseInt(lineMatcher.group(2));
 int localId = Integer.parseInt(lineMatcher.group(3));{code}
 This UID does seem very high to me but it has worked successfully on a 
 MapR-FS NFS share and stores files created with that UID over NFS.
 The UID / GID mappings for the HDFS NFS gateway will need to be switched to 
 using Long to accomodate this, I've attached a patch for the parsing and 
 UID/GID HashMaps.
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7564) NFS gateway dynamically reload UID/GID mapping file /etc/nfs.map

2014-12-22 Thread Hari Sekhon (JIRA)
Hari Sekhon created HDFS-7564:
-

 Summary: NFS gateway dynamically reload UID/GID mapping file 
/etc/nfs.map
 Key: HDFS-7564
 URL: https://issues.apache.org/jira/browse/HDFS-7564
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor


Add dynamic reload of the NFS gateway UID/GID mappings file /etc/nfs.map 
(default for static.id.mapping.file).

It seems that this is currently only loaded upon restart of the NFS gateway 
would cause active clients to hang or fail.

Regards,

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7563) NFS gateway parseStaticMap NumberFormatException

2014-12-22 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-7563:
--
Attachment: (was: UID_GID_Long_HashMaps.patch)

 NFS gateway parseStaticMap NumberFormatException
 

 Key: HDFS-7563
 URL: https://issues.apache.org/jira/browse/HDFS-7563
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Assignee: Aaron T. Myers

 When using the new NFS UID mapping for the HDFS NFS gateway I've discovered 
 that my Windows 7 workstation at this bank is passing UID number 4294xx 
 but entering this in the /etc/nfs.map in order to remap that to a Hadoop UID 
 prevents the NFS gateway from restarting with the error message:
 {code}Exception in thread main java.lang.NumberFormatException: For input 
 string: 4294xx
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:495)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.parseStaticMap(ShellBasedIdMapping.java:318)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.updateMaps(ShellBasedIdMapping.java:229)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.init(ShellBasedIdMapping.java:91)
 at 
 org.apache.hadoop.hdfs.nfs.nfs3.RpcProgramNfs3.init(RpcProgramNfs3.java:176)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.init(Nfs3.java:45)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:66)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:72)
 {code}
 The /etc/nfs.map file simply contains
 {code}
 uid 4294xx 1
 {code}
 It seems that the code at 
 {code}hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/ShellBasedIdMapping.java{code}
 is expecting an integer at line 318 of the parseStaticMap method: {code}int 
 remoteId = Integer.parseInt(lineMatcher.group(2));
 int localId = Integer.parseInt(lineMatcher.group(3));{code}
 This UID does seem very high to me but it has worked successfully on a 
 MapR-FS NFS share and stores files created with that UID over NFS.
 The UID / GID mappings for the HDFS NFS gateway will need to be switched to 
 using Long to accomodate this, I've attached a patch for the parsing and 
 UID/GID HashMaps.
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7563) NFS gateway parseStaticMap NumberFormatException

2014-12-22 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-7563:
--
Attachment: UID_GID_Long_HashMaps.patch

 NFS gateway parseStaticMap NumberFormatException
 

 Key: HDFS-7563
 URL: https://issues.apache.org/jira/browse/HDFS-7563
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: nfs
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Assignee: Aaron T. Myers
 Attachments: UID_GID_Long_HashMaps.patch


 When using the new NFS UID mapping for the HDFS NFS gateway I've discovered 
 that my Windows 7 workstation at this bank is passing UID number 4294xx 
 but entering this in the /etc/nfs.map in order to remap that to a Hadoop UID 
 prevents the NFS gateway from restarting with the error message:
 {code}Exception in thread main java.lang.NumberFormatException: For input 
 string: 4294xx
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:495)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.parseStaticMap(ShellBasedIdMapping.java:318)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.updateMaps(ShellBasedIdMapping.java:229)
 at 
 org.apache.hadoop.security.ShellBasedIdMapping.init(ShellBasedIdMapping.java:91)
 at 
 org.apache.hadoop.hdfs.nfs.nfs3.RpcProgramNfs3.init(RpcProgramNfs3.java:176)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.init(Nfs3.java:45)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:66)
 at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:72)
 {code}
 The /etc/nfs.map file simply contains
 {code}
 uid 4294xx 1
 {code}
 It seems that the code at 
 {code}hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/ShellBasedIdMapping.java{code}
 is expecting an integer at line 318 of the parseStaticMap method: {code}int 
 remoteId = Integer.parseInt(lineMatcher.group(2));
 int localId = Integer.parseInt(lineMatcher.group(3));{code}
 This UID does seem very high to me but it has worked successfully on a 
 MapR-FS NFS share and stores files created with that UID over NFS.
 The UID / GID mappings for the HDFS NFS gateway will need to be switched to 
 using Long to accomodate this, I've attached a patch for the parsing and 
 UID/GID HashMaps.
 Regards,
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >