[
https://issues.apache.org/jira/browse/HDFS-6114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061895#comment-14061895
]
Vinayakumar B commented on HDFS-6114:
-------------------------------------
bq. I don't really see a good reason to separate delBlockInfo and
delNewBlockInfo. It seems like this could just lead to scenarios where we think
we're deleting a block but it pops back up (because we deleted, but did not
delete new)
Here, both are working on different set. {{delBlockInfo}} is used in someother
places as well while updating the scantime and resort the blockInfoSet.
{{delNewBlockInfo}} is only needs to be called while deleting the block itself,
as intermediate updates will not happen on this set data.
So {{delBlockInfo}} and {{delNewBlockInfo}} serves separate purposes and both
are required.
bq. I guess maybe it makes sense to separate addBlockInfo from addNewBlockInfo,
just because there are places in the setup code where we're willing to add
stuff directly to blockInfoSet. Even in that case, I would argue it might be
easier to call addNewBlockInfo and then later roll all the newBlockInfoSet
items into blockInfoSet. The problem is that having both functions creates
confusion and increase the chance that someone will add an incorrect call to
the wrong one later on in another change.
As I am seeing, both these methods are private and acts on different sets.
since method name itself suggests {{addNewBlockInfo}} is only for the new
blocks. I am not seeing any confusion here.
bq. It seems like a bad idea to use BlockScanInfo.LAST_SCAN_TIME_COMPARATOR for
blockInfoSet, but BlockScanInfo#hashCode (i.e. the HashSet strategy) for
newBlockInfoSet. Let's just use a SortedSet for both so we don't have to ponder
any possible discrepancies between the comparator and the hash function.
{{blockInfoSet}} is required to be sorted based on the lastScanTime, as oldest
scanned block will be picked for scanning, which will be the first element in
this set always. BlockScanInfo.LAST_SCAN_TIME_COMPARATOR is used because
{{BlockScanInfo#hashCode()}} is default which will sort based on the blockId
rather than scan time.
Do you suggest me to update this {{hashCode()}} itself?
bq. Another problem with HashSet (compared with TreeSet) is that it never
shrinks down after enlarging... a bad property for a temporary holding area
Yes, this I agree, will update in the next patch.
> Block Scan log rolling will never happen if blocks written continuously
> leading to huge size of dncp_block_verification.log.curr
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-6114
> URL: https://issues.apache.org/jira/browse/HDFS-6114
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.3.0, 2.4.0
> Reporter: Vinayakumar B
> Assignee: Vinayakumar B
> Priority: Critical
> Attachments: HDFS-6114.patch, HDFS-6114.patch
>
>
> 1. {{BlockPoolSliceScanner#scan()}} will not return until all the blocks are
> scanned.
> 2. If the blocks (with size in several MBs) to datanode are written
> continuously
> then one iteration of {{BlockPoolSliceScanner#scan()}} will be continously
> scanning the blocks
> 3. These blocks will be deleted after some time (enough to get block scanned)
> 4. As Block Scanning is throttled, So verification of all blocks will take so
> much time.
> 5. Rolling will never happen, so even though the total number of blocks in
> datanode doesn't increases, entries ( which contains stale entries of deleted
> blocks) in *dncp_block_verification.log.curr* continuously increases leading
> to huge size.
> In one of our env, it grown more than 1TB where total number of blocks were
> only ~45k.
--
This message was sent by Atlassian JIRA
(v6.2#6252)