[jira] [Comment Edited] (HDFS-7240) Object store in HDFS

Anu Engineer (JIRA) Fri, 10 Jun 2016 15:24:03 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325389#comment-15325389
 ]


Anu Engineer edited comment on HDFS-7240 at 6/10/16 10:22 PM:
--------------------------------------------------------------

*Ozone meeting notes – Jun, 9th, 2016*

Attendees: ??Thomas Demoor, Arpit Agarwal, JV Jujjuri, Jing Zhao, Andrew Wang, 
Lei Xu, Aaron Myers, Colin McCabe, Aaron Fabbri, Lars Francke, Sijie Guo, 
Stiwari, Anu Engineer??


We started the discussion with how Erasure coding will be supported in ozone. 
This was quite a lengthy discussion taking over half the meeting time. Jing 
Zhao explained the high-level architecture and pointed to similar work done by 
Drobox. 

We then divide into details of this problem, since we wanted to make sure that 
supporting Erasure coding will be easy and efficient in ozone.

Here are the major points:

SCM currently supports a simple replicated container. To support Erasure 
coding, SCM will have to return more than 3 machines, let us say we were using 
6 + 3 model of erasure coding then then a container is spread across nine 
machines. Once we modify SCM to support this model, the container client will 
have write data to the locations and update the RAFT state with the metadata of 
this block.

When a file read happens in ozone, container client will go to KSM/SCM and find 
out the container to read the metadata from. The metadata will tell the client 
where the actual data is residing and it will re-construct the data from EC 
coded blocks.

We all agreed that getting EC done for ozone is an important goal, and to get 
to that objective, we will need to get the SCM and KSM done first.

We also discussed how small files will cause an issue with EC especially since 
container would pack lots of these together and how this would lead to 
requiring compaction due to deletes.

Eddy brought up this issue of making sure that data is spread evenly across the 
cluster. Currently our plan is to maintain a list of machines based on 
container reports. The container reports would contain number of keys, bytes 
stored and number of accesses to that container. Based on this SCM would be 
able to maintain a list that allows it to pick machines that are under-utilized 
from the cluster, thus ensuring a good data spread. Andrew Wang pointed out 
that counting I/O requests is not good enough and we actually need the number 
of bytes read/written. That is an excellent suggestion and we will modify 
container reports to have this information and will use that in SCMs allocation 
decisions.

Eddy followed up this question with how would something like Hive behave over 
ozone? Say hive creates a bucket, and creates lots of tables and after work, it 
deletes all the tables. Ozone would have allocated containers to accommodate 
the overflowing bucket. So it is possible to have many empty containers on an 
ozone cluster.

SCM is free to delete any container that does not have a key. This is because 
in the ozone world, metadata exists inside a container. Therefore, if a 
container is empty, then we know that no objects (Ozone volume, bucket or key) 
exists in that container. This gives the freedom to delete any empty container. 
This is how the containers would be removed in the ozone world.

Andrew Wang pointed out that it is possible to create thousands of volumes and 
map them to similar number of containers. He was worried that it would become a 
scalability bottle neck. While is this possible in reality if you have cluster 
with only volumes – then KSM is free to map as many ozone volumes to a 
container. We agreed that if this indeed becomes a problem, we can write a 
simple compaction tool for KSM which will move all these volumes to few 
containers. Then SCM delete containers would kick in and clean up the cluster.

We reiterated through all the scenarios for merge and concluded the for v1, 
ozone can live without needing to support merges of containers.

Then Eddy pointed out that by switching to range partitions from hash 
partitions we have introduced a variability in the list operations for a 
container. Since it is not documented on JIRA why we switched to using range 
partition, we discussed the issue which caused us to switch over to using range 
partition.

The original design called for hash partition and operations like list relying 
on secondary index. This would create an eventual consistency model where you 
might create key, but it is visible in the namespace only after the secondary 
index is updated. Colin argued that is easier for our users to see consistent 
namespace operations. This is the core reason why we moved to using range 
partitions.

However, range partitions do pose the issue, that a bucket might be split 
across a large number of containers and list operation does not have fixed time 
guarantees. The worst case scenario is if you have bucket with thousands of 5 
GB objects which internally causes that the bucket to be mapped over a set of 
containers. This would imply that list operation could have to be read 
sequentially from many containers to build the list.

We discussed many solutions to this problem:    
        
•       In the original design, we had proposed a separate meta-data container 
and data container. We can follow the same model, with the assumption that data 
container and metadata container are on the same machine. Both Andrew and 
Thomas seemed to think that is a good idea.

•       Anu argued that this may not be an issue since the datanode (front 
ends) would be able to cache lots of this info as well as pre-fetch lists since 
it is a forward iteration.

•       Arpit pointed out that while this is an issue that we need to tackle, 
we would need to build the system, measure and choose the appropriate solution 
based on data.

•       In an off-line conversation after the call, Jitendra pointed out that 
this will not have any performance impact since each split point is well known 
in KSM, it is trivial to add hints / caching in the KSM layer itself to address 
this issue – In other words, we can issue parallel reads to all the containers 
if the client wants 1000 keys and we know that we need to reach out to 3 
containers to get that many keys, since KSM would give us that hint.

While we agree that this is an issue that we might have to tackle eventually in 
ozone world, we were not able to converge to an exact solution since we ran out 
of time at this point.

ATM mentioned that we would benefit by getting together and doing some white 
boarding of ozone’s design and we intend to do that soon.

This was a very productive discussion and I want thank all participants.  It 
was a pleasure talking to all of you.

Please feel free to add/edit these notes for completeness or corrections.














was (Author: anu):
*Ozone meeting notes – Jun, 9th, 2016*

Attendees: ??Thomas Demoor, Arpit Agarwal, JV Jujjuri, Jing Zhao, Andrew Wang, 
Lei Xu, Aaron Myers, Colin McCabe, Aaron Fabbri, Lars Francke, Stiwari, Anu 
Engineer??


We started the discussion with how Erasure coding will be supported in ozone. 
This was quite a lengthy discussion taking over half the meeting time. Jing 
Zhao explained the high-level architecture and pointed to similar work done by 
Drobox. 

We then divide into details of this problem, since we wanted to make sure that 
supporting Erasure coding will be easy and efficient in ozone.

Here are the major points:

SCM currently supports a simple replicated container. To support Erasure 
coding, SCM will have to return more than 3 machines, let us say we were using 
6 + 3 model of erasure coding then then a container is spread across nine 
machines. Once we modify SCM to support this model, the container client will 
have write data to the locations and update the RAFT state with the metadata of 
this block.

When a file read happens in ozone, container client will go to KSM/SCM and find 
out the container to read the metadata from. The metadata will tell the client 
where the actual data is residing and it will re-construct the data from EC 
coded blocks.

We all agreed that getting EC done for ozone is an important goal, and to get 
to that objective, we will need to get the SCM and KSM done first.

We also discussed how small files will cause an issue with EC especially since 
container would pack lots of these together and how this would lead to 
requiring compaction due to deletes.

Eddy brought up this issue of making sure that data is spread evenly across the 
cluster. Currently our plan is to maintain a list of machines based on 
container reports. The container reports would contain number of keys, bytes 
stored and number of accesses to that container. Based on this SCM would be 
able to maintain a list that allows it to pick machines that are under-utilized 
from the cluster, thus ensuring a good data spread. Andrew Wang pointed out 
that counting I/O requests is not good enough and we actually need the number 
of bytes read/written. That is an excellent suggestion and we will modify 
container reports to have this information and will use that in SCMs allocation 
decisions.

Eddy followed up this question with how would something like Hive behave over 
ozone? Say hive creates a bucket, and creates lots of tables and after work, it 
deletes all the tables. Ozone would have allocated containers to accommodate 
the overflowing bucket. So it is possible to have many empty containers on an 
ozone cluster.

SCM is free to delete any container that does not have a key. This is because 
in the ozone world, metadata exists inside a container. Therefore, if a 
container is empty, then we know that no objects (Ozone volume, bucket or key) 
exists in that container. This gives the freedom to delete any empty container. 
This is how the containers would be removed in the ozone world.

Andrew Wang pointed out that it is possible to create thousands of volumes and 
map them to similar number of containers. He was worried that it would become a 
scalability bottle neck. While is this possible in reality if you have cluster 
with only volumes – then KSM is free to map as many ozone volumes to a 
container. We agreed that if this indeed becomes a problem, we can write a 
simple compaction tool for KSM which will move all these volumes to few 
containers. Then SCM delete containers would kick in and clean up the cluster.

We reiterated through all the scenarios for merge and concluded the for v1, 
ozone can live without needing to support merges of containers.

Then Eddy pointed out that by switching to range partitions from hash 
partitions we have introduced a variability in the list operations for a 
container. Since it is not documented on JIRA why we switched to using range 
partition, we discussed the issue which caused us to switch over to using range 
partition.

The original design called for hash partition and operations like list relying 
on secondary index. This would create an eventual consistency model where you 
might create key, but it is visible in the namespace only after the secondary 
index is updated. Colin argued that is easier for our users to see consistent 
namespace operations. This is the core reason why we moved to using range 
partitions.

However, range partitions do pose the issue, that a bucket might be split 
across a large number of containers and list operation does not have fixed time 
guarantees. The worst case scenario is if you have bucket with thousands of 5 
GB objects which internally causes that the bucket to be mapped over a set of 
containers. This would imply that list operation could have to be read 
sequentially from many containers to build the list.

We discussed many solutions to this problem:    
        
•       In the original design, we had proposed a separate meta-data container 
and data container. We can follow the same model, with the assumption that data 
container and metadata container are on the same machine. Both Andrew and 
Thomas seemed to think that is a good idea.

•       Anu argued that this may not be an issue since the datanode (front 
ends) would be able to cache lots of this info as well as pre-fetch lists since 
it is a forward iteration.

•       Arpit pointed out that while this is an issue that we need to tackle, 
we would need to build the system, measure and choose the appropriate solution 
based on data.

•       In an off-line conversation after the call, Jitendra pointed out that 
this will not have any performance impact since each split point is well known 
in KSM, it is trivial to add hints / caching in the KSM layer itself to address 
this issue – In other words, we can issue parallel reads to all the containers 
if the client wants 1000 keys and we know that we need to reach out to 3 
containers to get that many keys, since KSM would give us that hint.

While we agree that this is an issue that we might have to tackle eventually in 
ozone world, we were not able to converge to an exact solution since we ran out 
of time at this point.

ATM mentioned that we would benefit by getting together and doing some white 
boarding of ozone’s design and we intend to do that soon.

This was a very productive discussion and I want thank all participants.  It 
was a pleasure talking to all of you.

Please feel free to add/edit these notes for completeness or corrections.













> Object store in HDFS
> --------------------
>
>                 Key: HDFS-7240
>                 URL: https://issues.apache.org/jira/browse/HDFS-7240
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: Ozone-architecture-v1.pdf, Ozonedesignupdate.pdf, 
> ozone_user_v0.pdf
>
>
> This jira proposes to add object store capabilities into HDFS. 
> As part of the federation work (HDFS-1052) we separated block storage as a 
> generic storage layer. Using the Block Pool abstraction, new kinds of 
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode 
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-7240) Object store in HDFS

Reply via email to