[ 
https://issues.apache.org/jira/browse/HDDS-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779157#comment-16779157
 ] 

Elek, Marton commented on HDDS-1084:
------------------------------------

Thank you very much [~swagle] to upload the design docs. It looks very 
promising. I think it will be a very important (if not the most important) and 
useful part of the Ozone stack.

I have a few comments (not proposals just brainstorming ideas):

1. I have big fan of the 
[copyset|https://web.stanford.edu/~skatti/pubs/usenix13-copysets.pdf] and 
[tiered 
replication|https://www.usenix.org/system/files/conference/atc15/atc15-paper-cidon.pdf].
 On basic level they can provide some information about the possibility of the 
data losses based on calculating the different datanode sets (eg. container 1 
is replicated to the datanode set d1,d2,d3, container 2 is replicated to 
d3,d4,d5), and the number of the containers/data sets.

We already discussed with [~anu] and [~nandakumar131] how these findings can be 
used to replicate the closed containers in a safer way. I think recon server 
also can do some analyses about these questions (long-term).

Eg: "3 independent node failures will cause a dataloss with 90% probability on 
this cluster"

Or: "any of the 3 racks can be turned off without any data loss"

2. I saw a demo about the Ceph UI. It worked very well with embedding grafana 
dashboards to the HTML ui. We already have some grafana dashboard definitions 
in hadoop-ozone/dist/src/main/compose/common/grafana which displays the metrics 
from the prometheus.

  a.) Until a full featured Ozone Console is implemented it seems to be an easy 
way to display any data from recon db.  
  b.) Later it could be easy to adopt existing dashboards in an Ozone UI. The 
easiest way to provide powerful statistics is embedding a grafana (IMHO)

3. Similar to the grafana I would expect to have at least one prometheus 
instance together with Ozone (in production). We have native prometheus support 
(we have prometheus metric endpoints and all the hadoop metrics can be saved to 
prometheus). We can use it as an (optional) source of additional data (eg. 
detect unreliable datanodes and propose changes). This is not required for the 
existing queries in the design doc but it can be considered in the future.


But these are just ideas, nothing should be done as of now. Thanks again the 
work on this great feature.

> Ozone Recon Service
> -------------------
>
>                 Key: HDDS-1084
>                 URL: https://issues.apache.org/jira/browse/HDDS-1084
>             Project: Hadoop Distributed Data Store
>          Issue Type: New Feature
>          Components: fsck
>    Affects Versions: 0.4.0
>            Reporter: Siddharth Wagle
>            Assignee: Siddharth Wagle
>            Priority: Major
>         Attachments: Ozone_Recon_Design_V1_Draft.pdf
>
>
> Recon Server at a high level will maintain a global view of Ozone that is not 
> available from SCM or OM. Things like how many volumes exist; and how many 
> buckets exist per volume; which volume has maximum buckets; which are buckets 
> that have not been accessed for a year, which are the corrupt blocks, which 
> are blocks on data nodes which are not used; and answer similar queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to