[jira] [Comment Edited] (HDFS-11888) Ozone: SCM: use container state machine for open containers allocated for key/blcoks

Xiaoyu Yao (JIRA) Tue, 22 Aug 2017 11:38:26 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137172#comment-16137172
 ]


Xiaoyu Yao edited comment on HDFS-11888 at 8/22/17 6:37 PM:
------------------------------------------------------------

Thanks [~anu] for the detailed review. I've fixed all the comments except the 
second one which may not be an error.

bq. BlockContainerInfo#addUsed – Should we rename this ? Since it is possible 
to delete blocks? And may be changeUsed – so you can pass a negative arg via 
size or just add one more function.

Add subtractUsed() as suggested. We will use it when the delete key/block work 
is integrated. 

bq. BlockManagerImpl#loadAllocatedContainers LOG.warn("Container {} allocated 
by block service can't be found in SCM", containerName); Should we log this as 
an Error? I like the fact that we can continue, but may it is a more 
significant issue?
Not fixed: This may not be an error. Added TODO to be fixed in the next patch. 
This could happen when the allocated container failed to be created in a timely 
manner and got cleaned up by SCM due to timeout. In that case, we actually 
should remove those from the allocated container DB of block manager. Plan to 
do the clean up work when working on the next patch to clean up the containers 
in CREATING state that times out. 

bq. BlockManagerImpl#loadAllocatedContainers LOG.warn("Failed loading open 
container, continue next...");This error message seems misleading, same with 
messages below where it talks about open containers. We are opening all 
Allocated containers now, right?
Fixed.

bq. BlockManagerImpl#updateContainer nit: There is a commented out if 
statement, which I think you can remove.
Fixed.

bq. BlockManagerImpl#updateContainer nit: can we please a comment here, that we 
are relying on the lock in the allocateBlock function to make sure that 
contianers map remains consistent.
Comments added.

bq. BlockManagerImpl#allocateBlock Line 325: Can we please add a logging 
statement here so that we know why the IOException happened ? or return the ex 
to the user via making SCMException a wrapper to the IOException we got? I just 
want some way to see the original exception.
Fixed.

bq. containerManager.getContainer – Can this function ever return null ? More 
of a question. We seem to use this with possible return of null and assuming 
the return will be valid.
Good catch. Fixed by adding null check and trace.

bq. BlockManagerImpl#deleteBlock – Do we need to update the ContainerInfo 
usedSize ?
This will be handled with the key/block delete integration. the usedSize 
subtraction will be called when the async thread actually finish reclaiming the 
space on the datanodes.

bq. ContainerInfo#setState Instead of Time.now() can we please use 
this.stateEnterTime = Time.monotonicNow(). I am just worried that in places wit 
Daylight Saving you will see the clock wind backwards.
Fixed.

bq. ContainerMapping- ctor while it is very clear for me when I read code 
(probably because I did the code review for the state machine) it might be a 
good idea to write some comments about the state machine. Especially some 
comments on final states and transitions. Understanding this state machine is 
going to be vital in understanding container management in SCM. Even over 
commenting this would not hurt us.
Good idea. Comments added.

bq. ContianerMapping#allocateContainer Line 231 Replace Time.now 
==>.setStateEnterTime(Time.monotonicNow())
Fixed.

bq. ContianerMapping#deleteContainer – Is it possible we need any changes to 
state during this call ? Does the state move to deleted ? Should this call fire 
an event ? More of question than an assertion.
This API is mainly used by client drive container deletion. Like the 
implementation in ContainerOperationClient#deleteContainer(), the client will 
always delete the container on DN first and then delete container on SCM with 
this API. So we don't need the two phase DELETING process. In the case of SCM 
driven container deletion due to creation timeout, we will need the DELETING 
state and I plan to handle it in a separate JIRA. 




was (Author: xyao):
Thanks [~anu] for the detailed review. 

bq. BlockContainerInfo#addUsed – Should we rename this ? Since it is possible 
to delete blocks? And may be changeUsed – so you can pass a negative arg via 
size or just add one more function.

Add subtractUsed() as suggested. We will use it when the delete key/block work 
is integrated. 

bq. BlockManagerImpl#loadAllocatedContainers LOG.warn("Container {} allocated 
by block service can't be found in SCM", containerName); Should we log this as 
an Error? I like the fact that we can continue, but may it is a more 
significant issue?
This may not be an error. Added TODO to be fixed in the next patch. This could 
happen when the allocated container failed to be created in a timely manner and 
got cleaned up by SCM due to timeout. In that case, we actually should remove 
those from the allocated container DB of block manager. Plan to do the clean up 
work when working on the next patch to clean up the containers in CREATING 
state that times out. 

bq. BlockManagerImpl#loadAllocatedContainers LOG.warn("Failed loading open 
container, continue next...");This error message seems misleading, same with 
messages below where it talks about open containers. We are opening all 
Allocated containers now, right?
Fixed.

bq. BlockManagerImpl#updateContainer nit: There is a commented out if 
statement, which I think you can remove.
Fixed.

bq. BlockManagerImpl#updateContainer nit: can we please a comment here, that we 
are relying on the lock in the allocateBlock function to make sure that 
contianers map remains consistent.
Comments added.

bq. BlockManagerImpl#allocateBlock Line 325: Can we please add a logging 
statement here so that we know why the IOException happened ? or return the ex 
to the user via making SCMException a wrapper to the IOException we got? I just 
want some way to see the original exception.
Fixed.

bq. containerManager.getContainer – Can this function ever return null ? More 
of a question. We seem to use this with possible return of null and assuming 
the return will be valid.
Good catch. Fixed by adding null check and trace.

bq. BlockManagerImpl#deleteBlock – Do we need to update the ContainerInfo 
usedSize ?
This will be handled with the key/block delete integration. the usedSize 
subtraction will be called when the async thread actually finish reclaiming the 
space on the datanodes.

bq. ContainerInfo#setState Instead of Time.now() can we please use 
this.stateEnterTime = Time.monotonicNow(). I am just worried that in places wit 
Daylight Saving you will see the clock wind backwards.
Fixed.

bq. ContainerMapping- ctor while it is very clear for me when I read code 
(probably because I did the code review for the state machine) it might be a 
good idea to write some comments about the state machine. Especially some 
comments on final states and transitions. Understanding this state machine is 
going to be vital in understanding container management in SCM. Even over 
commenting this would not hurt us.
Good idea. Comments added.

bq. ContianerMapping#allocateContainer Line 231 Replace Time.now 
==>.setStateEnterTime(Time.monotonicNow())
Fixed.

bq. ContianerMapping#deleteContainer – Is it possible we need any changes to 
state during this call ? Does the state move to deleted ? Should this call fire 
an event ? More of question than an assertion.
This API is mainly used by client drive container deletion. Like the 
implementation in ContainerOperationClient#deleteContainer(), the client will 
always delete the container on DN first and then delete container on SCM with 
this API. So we don't need the two phase DELETING process. In the case of SCM 
driven container deletion due to creation timeout, we will need the DELETING 
state and I plan to handle it in a separate JIRA. 



> Ozone: SCM: use container state machine for open containers allocated for 
> key/blcoks 
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-11888
>                 URL: https://issues.apache.org/jira/browse/HDFS-11888
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>    Affects Versions: HDFS-7240
>            Reporter: Xiaoyu Yao
>            Assignee: Xiaoyu Yao
>         Attachments: HDFS-11888-HDFS-7240.001.patch
>
>
> SCM BlockManager provision a pool of containers upon block creation request 
> that can't be satisfied with current open containers in the pool. However, 
> only one container is returned with the creationFlag to the client. The other 
> container provisioned in the same batch will not have this flag. Client can't 
> assume to use these containers that has not been created on SCM datanodes, 
> This ticket is opened to fix the issue by persist the createContainerNeeded 
> flag for the provisioned containers. The flag will be eventually cleared by 
> processing container report from datanode when the container report handler 
> is fully implemented on SCM. 
> For now, we will use a default batch size of 1 for 
> ozone.scm.container.provision_batch_size so that the container will be 
> created on demand upon the first block allocation into the container. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-11888) Ozone: SCM: use container state machine for open containers allocated for key/blcoks

Reply via email to