[
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325389#comment-15325389
]
Anu Engineer edited comment on HDFS-7240 at 6/10/16 10:22 PM:
--------------------------------------------------------------
*Ozone meeting notes – Jun, 9th, 2016*
Attendees: ??Thomas Demoor, Arpit Agarwal, JV Jujjuri, Jing Zhao, Andrew Wang,
Lei Xu, Aaron Myers, Colin McCabe, Aaron Fabbri, Lars Francke, Sijie Guo,
Stiwari, Anu Engineer??
We started the discussion with how Erasure coding will be supported in ozone.
This was quite a lengthy discussion taking over half the meeting time. Jing
Zhao explained the high-level architecture and pointed to similar work done by
Drobox.
We then divide into details of this problem, since we wanted to make sure that
supporting Erasure coding will be easy and efficient in ozone.
Here are the major points:
SCM currently supports a simple replicated container. To support Erasure
coding, SCM will have to return more than 3 machines, let us say we were using
6 + 3 model of erasure coding then then a container is spread across nine
machines. Once we modify SCM to support this model, the container client will
have write data to the locations and update the RAFT state with the metadata of
this block.
When a file read happens in ozone, container client will go to KSM/SCM and find
out the container to read the metadata from. The metadata will tell the client
where the actual data is residing and it will re-construct the data from EC
coded blocks.
We all agreed that getting EC done for ozone is an important goal, and to get
to that objective, we will need to get the SCM and KSM done first.
We also discussed how small files will cause an issue with EC especially since
container would pack lots of these together and how this would lead to
requiring compaction due to deletes.
Eddy brought up this issue of making sure that data is spread evenly across the
cluster. Currently our plan is to maintain a list of machines based on
container reports. The container reports would contain number of keys, bytes
stored and number of accesses to that container. Based on this SCM would be
able to maintain a list that allows it to pick machines that are under-utilized
from the cluster, thus ensuring a good data spread. Andrew Wang pointed out
that counting I/O requests is not good enough and we actually need the number
of bytes read/written. That is an excellent suggestion and we will modify
container reports to have this information and will use that in SCMs allocation
decisions.
Eddy followed up this question with how would something like Hive behave over
ozone? Say hive creates a bucket, and creates lots of tables and after work, it
deletes all the tables. Ozone would have allocated containers to accommodate
the overflowing bucket. So it is possible to have many empty containers on an
ozone cluster.
SCM is free to delete any container that does not have a key. This is because
in the ozone world, metadata exists inside a container. Therefore, if a
container is empty, then we know that no objects (Ozone volume, bucket or key)
exists in that container. This gives the freedom to delete any empty container.
This is how the containers would be removed in the ozone world.
Andrew Wang pointed out that it is possible to create thousands of volumes and
map them to similar number of containers. He was worried that it would become a
scalability bottle neck. While is this possible in reality if you have cluster
with only volumes – then KSM is free to map as many ozone volumes to a
container. We agreed that if this indeed becomes a problem, we can write a
simple compaction tool for KSM which will move all these volumes to few
containers. Then SCM delete containers would kick in and clean up the cluster.
We reiterated through all the scenarios for merge and concluded the for v1,
ozone can live without needing to support merges of containers.
Then Eddy pointed out that by switching to range partitions from hash
partitions we have introduced a variability in the list operations for a
container. Since it is not documented on JIRA why we switched to using range
partition, we discussed the issue which caused us to switch over to using range
partition.
The original design called for hash partition and operations like list relying
on secondary index. This would create an eventual consistency model where you
might create key, but it is visible in the namespace only after the secondary
index is updated. Colin argued that is easier for our users to see consistent
namespace operations. This is the core reason why we moved to using range
partitions.
However, range partitions do pose the issue, that a bucket might be split
across a large number of containers and list operation does not have fixed time
guarantees. The worst case scenario is if you have bucket with thousands of 5
GB objects which internally causes that the bucket to be mapped over a set of
containers. This would imply that list operation could have to be read
sequentially from many containers to build the list.
We discussed many solutions to this problem:
• In the original design, we had proposed a separate meta-data container
and data container. We can follow the same model, with the assumption that data
container and metadata container are on the same machine. Both Andrew and
Thomas seemed to think that is a good idea.
• Anu argued that this may not be an issue since the datanode (front
ends) would be able to cache lots of this info as well as pre-fetch lists since
it is a forward iteration.
• Arpit pointed out that while this is an issue that we need to tackle,
we would need to build the system, measure and choose the appropriate solution
based on data.
• In an off-line conversation after the call, Jitendra pointed out that
this will not have any performance impact since each split point is well known
in KSM, it is trivial to add hints / caching in the KSM layer itself to address
this issue – In other words, we can issue parallel reads to all the containers
if the client wants 1000 keys and we know that we need to reach out to 3
containers to get that many keys, since KSM would give us that hint.
While we agree that this is an issue that we might have to tackle eventually in
ozone world, we were not able to converge to an exact solution since we ran out
of time at this point.
ATM mentioned that we would benefit by getting together and doing some white
boarding of ozone’s design and we intend to do that soon.
This was a very productive discussion and I want thank all participants. It
was a pleasure talking to all of you.
Please feel free to add/edit these notes for completeness or corrections.
was (Author: anu):
*Ozone meeting notes – Jun, 9th, 2016*
Attendees: ??Thomas Demoor, Arpit Agarwal, JV Jujjuri, Jing Zhao, Andrew Wang,
Lei Xu, Aaron Myers, Colin McCabe, Aaron Fabbri, Lars Francke, Stiwari, Anu
Engineer??
We started the discussion with how Erasure coding will be supported in ozone.
This was quite a lengthy discussion taking over half the meeting time. Jing
Zhao explained the high-level architecture and pointed to similar work done by
Drobox.
We then divide into details of this problem, since we wanted to make sure that
supporting Erasure coding will be easy and efficient in ozone.
Here are the major points:
SCM currently supports a simple replicated container. To support Erasure
coding, SCM will have to return more than 3 machines, let us say we were using
6 + 3 model of erasure coding then then a container is spread across nine
machines. Once we modify SCM to support this model, the container client will
have write data to the locations and update the RAFT state with the metadata of
this block.
When a file read happens in ozone, container client will go to KSM/SCM and find
out the container to read the metadata from. The metadata will tell the client
where the actual data is residing and it will re-construct the data from EC
coded blocks.
We all agreed that getting EC done for ozone is an important goal, and to get
to that objective, we will need to get the SCM and KSM done first.
We also discussed how small files will cause an issue with EC especially since
container would pack lots of these together and how this would lead to
requiring compaction due to deletes.
Eddy brought up this issue of making sure that data is spread evenly across the
cluster. Currently our plan is to maintain a list of machines based on
container reports. The container reports would contain number of keys, bytes
stored and number of accesses to that container. Based on this SCM would be
able to maintain a list that allows it to pick machines that are under-utilized
from the cluster, thus ensuring a good data spread. Andrew Wang pointed out
that counting I/O requests is not good enough and we actually need the number
of bytes read/written. That is an excellent suggestion and we will modify
container reports to have this information and will use that in SCMs allocation
decisions.
Eddy followed up this question with how would something like Hive behave over
ozone? Say hive creates a bucket, and creates lots of tables and after work, it
deletes all the tables. Ozone would have allocated containers to accommodate
the overflowing bucket. So it is possible to have many empty containers on an
ozone cluster.
SCM is free to delete any container that does not have a key. This is because
in the ozone world, metadata exists inside a container. Therefore, if a
container is empty, then we know that no objects (Ozone volume, bucket or key)
exists in that container. This gives the freedom to delete any empty container.
This is how the containers would be removed in the ozone world.
Andrew Wang pointed out that it is possible to create thousands of volumes and
map them to similar number of containers. He was worried that it would become a
scalability bottle neck. While is this possible in reality if you have cluster
with only volumes – then KSM is free to map as many ozone volumes to a
container. We agreed that if this indeed becomes a problem, we can write a
simple compaction tool for KSM which will move all these volumes to few
containers. Then SCM delete containers would kick in and clean up the cluster.
We reiterated through all the scenarios for merge and concluded the for v1,
ozone can live without needing to support merges of containers.
Then Eddy pointed out that by switching to range partitions from hash
partitions we have introduced a variability in the list operations for a
container. Since it is not documented on JIRA why we switched to using range
partition, we discussed the issue which caused us to switch over to using range
partition.
The original design called for hash partition and operations like list relying
on secondary index. This would create an eventual consistency model where you
might create key, but it is visible in the namespace only after the secondary
index is updated. Colin argued that is easier for our users to see consistent
namespace operations. This is the core reason why we moved to using range
partitions.
However, range partitions do pose the issue, that a bucket might be split
across a large number of containers and list operation does not have fixed time
guarantees. The worst case scenario is if you have bucket with thousands of 5
GB objects which internally causes that the bucket to be mapped over a set of
containers. This would imply that list operation could have to be read
sequentially from many containers to build the list.
We discussed many solutions to this problem:
• In the original design, we had proposed a separate meta-data container
and data container. We can follow the same model, with the assumption that data
container and metadata container are on the same machine. Both Andrew and
Thomas seemed to think that is a good idea.
• Anu argued that this may not be an issue since the datanode (front
ends) would be able to cache lots of this info as well as pre-fetch lists since
it is a forward iteration.
• Arpit pointed out that while this is an issue that we need to tackle,
we would need to build the system, measure and choose the appropriate solution
based on data.
• In an off-line conversation after the call, Jitendra pointed out that
this will not have any performance impact since each split point is well known
in KSM, it is trivial to add hints / caching in the KSM layer itself to address
this issue – In other words, we can issue parallel reads to all the containers
if the client wants 1000 keys and we know that we need to reach out to 3
containers to get that many keys, since KSM would give us that hint.
While we agree that this is an issue that we might have to tackle eventually in
ozone world, we were not able to converge to an exact solution since we ran out
of time at this point.
ATM mentioned that we would benefit by getting together and doing some white
boarding of ozone’s design and we intend to do that soon.
This was a very productive discussion and I want thank all participants. It
was a pleasure talking to all of you.
Please feel free to add/edit these notes for completeness or corrections.
> Object store in HDFS
> --------------------
>
> Key: HDFS-7240
> URL: https://issues.apache.org/jira/browse/HDFS-7240
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Jitendra Nath Pandey
> Assignee: Jitendra Nath Pandey
> Attachments: Ozone-architecture-v1.pdf, Ozonedesignupdate.pdf,
> ozone_user_v0.pdf
>
>
> This jira proposes to add object store capabilities into HDFS.
> As part of the federation work (HDFS-1052) we separated block storage as a
> generic storage layer. Using the Block Pool abstraction, new kinds of
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]