[
https://issues.apache.org/jira/browse/HDFS-11888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137172#comment-16137172
]
Xiaoyu Yao edited comment on HDFS-11888 at 8/22/17 6:37 PM:
------------------------------------------------------------
Thanks [~anu] for the detailed review. I've fixed all the comments except the
second one which may not be an error.
bq. BlockContainerInfo#addUsed – Should we rename this ? Since it is possible
to delete blocks? And may be changeUsed – so you can pass a negative arg via
size or just add one more function.
Add subtractUsed() as suggested. We will use it when the delete key/block work
is integrated.
bq. BlockManagerImpl#loadAllocatedContainers LOG.warn("Container {} allocated
by block service can't be found in SCM", containerName); Should we log this as
an Error? I like the fact that we can continue, but may it is a more
significant issue?
Not fixed: This may not be an error. Added TODO to be fixed in the next patch.
This could happen when the allocated container failed to be created in a timely
manner and got cleaned up by SCM due to timeout. In that case, we actually
should remove those from the allocated container DB of block manager. Plan to
do the clean up work when working on the next patch to clean up the containers
in CREATING state that times out.
bq. BlockManagerImpl#loadAllocatedContainers LOG.warn("Failed loading open
container, continue next...");This error message seems misleading, same with
messages below where it talks about open containers. We are opening all
Allocated containers now, right?
Fixed.
bq. BlockManagerImpl#updateContainer nit: There is a commented out if
statement, which I think you can remove.
Fixed.
bq. BlockManagerImpl#updateContainer nit: can we please a comment here, that we
are relying on the lock in the allocateBlock function to make sure that
contianers map remains consistent.
Comments added.
bq. BlockManagerImpl#allocateBlock Line 325: Can we please add a logging
statement here so that we know why the IOException happened ? or return the ex
to the user via making SCMException a wrapper to the IOException we got? I just
want some way to see the original exception.
Fixed.
bq. containerManager.getContainer – Can this function ever return null ? More
of a question. We seem to use this with possible return of null and assuming
the return will be valid.
Good catch. Fixed by adding null check and trace.
bq. BlockManagerImpl#deleteBlock – Do we need to update the ContainerInfo
usedSize ?
This will be handled with the key/block delete integration. the usedSize
subtraction will be called when the async thread actually finish reclaiming the
space on the datanodes.
bq. ContainerInfo#setState Instead of Time.now() can we please use
this.stateEnterTime = Time.monotonicNow(). I am just worried that in places wit
Daylight Saving you will see the clock wind backwards.
Fixed.
bq. ContainerMapping- ctor while it is very clear for me when I read code
(probably because I did the code review for the state machine) it might be a
good idea to write some comments about the state machine. Especially some
comments on final states and transitions. Understanding this state machine is
going to be vital in understanding container management in SCM. Even over
commenting this would not hurt us.
Good idea. Comments added.
bq. ContianerMapping#allocateContainer Line 231 Replace Time.now
==>.setStateEnterTime(Time.monotonicNow())
Fixed.
bq. ContianerMapping#deleteContainer – Is it possible we need any changes to
state during this call ? Does the state move to deleted ? Should this call fire
an event ? More of question than an assertion.
This API is mainly used by client drive container deletion. Like the
implementation in ContainerOperationClient#deleteContainer(), the client will
always delete the container on DN first and then delete container on SCM with
this API. So we don't need the two phase DELETING process. In the case of SCM
driven container deletion due to creation timeout, we will need the DELETING
state and I plan to handle it in a separate JIRA.
was (Author: xyao):
Thanks [~anu] for the detailed review.
bq. BlockContainerInfo#addUsed – Should we rename this ? Since it is possible
to delete blocks? And may be changeUsed – so you can pass a negative arg via
size or just add one more function.
Add subtractUsed() as suggested. We will use it when the delete key/block work
is integrated.
bq. BlockManagerImpl#loadAllocatedContainers LOG.warn("Container {} allocated
by block service can't be found in SCM", containerName); Should we log this as
an Error? I like the fact that we can continue, but may it is a more
significant issue?
This may not be an error. Added TODO to be fixed in the next patch. This could
happen when the allocated container failed to be created in a timely manner and
got cleaned up by SCM due to timeout. In that case, we actually should remove
those from the allocated container DB of block manager. Plan to do the clean up
work when working on the next patch to clean up the containers in CREATING
state that times out.
bq. BlockManagerImpl#loadAllocatedContainers LOG.warn("Failed loading open
container, continue next...");This error message seems misleading, same with
messages below where it talks about open containers. We are opening all
Allocated containers now, right?
Fixed.
bq. BlockManagerImpl#updateContainer nit: There is a commented out if
statement, which I think you can remove.
Fixed.
bq. BlockManagerImpl#updateContainer nit: can we please a comment here, that we
are relying on the lock in the allocateBlock function to make sure that
contianers map remains consistent.
Comments added.
bq. BlockManagerImpl#allocateBlock Line 325: Can we please add a logging
statement here so that we know why the IOException happened ? or return the ex
to the user via making SCMException a wrapper to the IOException we got? I just
want some way to see the original exception.
Fixed.
bq. containerManager.getContainer – Can this function ever return null ? More
of a question. We seem to use this with possible return of null and assuming
the return will be valid.
Good catch. Fixed by adding null check and trace.
bq. BlockManagerImpl#deleteBlock – Do we need to update the ContainerInfo
usedSize ?
This will be handled with the key/block delete integration. the usedSize
subtraction will be called when the async thread actually finish reclaiming the
space on the datanodes.
bq. ContainerInfo#setState Instead of Time.now() can we please use
this.stateEnterTime = Time.monotonicNow(). I am just worried that in places wit
Daylight Saving you will see the clock wind backwards.
Fixed.
bq. ContainerMapping- ctor while it is very clear for me when I read code
(probably because I did the code review for the state machine) it might be a
good idea to write some comments about the state machine. Especially some
comments on final states and transitions. Understanding this state machine is
going to be vital in understanding container management in SCM. Even over
commenting this would not hurt us.
Good idea. Comments added.
bq. ContianerMapping#allocateContainer Line 231 Replace Time.now
==>.setStateEnterTime(Time.monotonicNow())
Fixed.
bq. ContianerMapping#deleteContainer – Is it possible we need any changes to
state during this call ? Does the state move to deleted ? Should this call fire
an event ? More of question than an assertion.
This API is mainly used by client drive container deletion. Like the
implementation in ContainerOperationClient#deleteContainer(), the client will
always delete the container on DN first and then delete container on SCM with
this API. So we don't need the two phase DELETING process. In the case of SCM
driven container deletion due to creation timeout, we will need the DELETING
state and I plan to handle it in a separate JIRA.
> Ozone: SCM: use container state machine for open containers allocated for
> key/blcoks
> -------------------------------------------------------------------------------------
>
> Key: HDFS-11888
> URL: https://issues.apache.org/jira/browse/HDFS-11888
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ozone
> Affects Versions: HDFS-7240
> Reporter: Xiaoyu Yao
> Assignee: Xiaoyu Yao
> Attachments: HDFS-11888-HDFS-7240.001.patch
>
>
> SCM BlockManager provision a pool of containers upon block creation request
> that can't be satisfied with current open containers in the pool. However,
> only one container is returned with the creationFlag to the client. The other
> container provisioned in the same batch will not have this flag. Client can't
> assume to use these containers that has not been created on SCM datanodes,
> This ticket is opened to fix the issue by persist the createContainerNeeded
> flag for the provisioned containers. The flag will be eventually cleared by
> processing container report from datanode when the container report handler
> is fully implemented on SCM.
> For now, we will use a default batch size of 1 for
> ozone.scm.container.provision_batch_size so that the container will be
> created on demand upon the first block allocation into the container.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]