[
https://issues.apache.org/jira/browse/HDFS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249132#comment-14249132
]
Colin Patrick McCabe commented on HDFS-7430:
--------------------------------------------
bq. getVolumeStats and VolumeScanner#getStatistics can be @VisibleForTesting
ok
bq. IIUC the getUnitTestLong is for the INTERNAL_ config keys, but it's not
being used for MAX_STALENESS_MS. This one is also seemingly not set by any unit
tests, so maybe can be removed.
Yeah, we should use {{getUnitTestLong}} here. Added. I would prefer to keep
this here in case we ever wanted to expose it, or if we added a test for it
later. The unit tests does test different staleness settings, but using the
setter directly at the moment.
bq. Would prefer Precondition checks that abort with a message rather than
Math.max'ing with 0. Users should know when the have an invalid config, maybe
they typoed a minus sign rather than wanting a value of zero. This isn't
necessary for the INTERNAL_ ones though, since they're only set by tests.
So, we already have a convention going on that setting
{{dfs.datanode.scan.period.hours}} to -1 indicates that the block scanner
should be disabled. It's not my favorite convention, but I don't want to break
user configs over a bikeshed issue. I would also prefer to handle all config
keys uniformly, rather than throwing an exception if some were negative, but
letting others go negative. I think the likelihood of accidentally setting a
negative value is pretty small... I would expect most mistakes to be leaving
out a zero, or just not setting the key at all, etc.
bq. Again a comment about locking would be nice. I see direct usage of
FSVolumes which I think is safe, but dunno really.
[~eddyxu] and I have been discussing this in HDFS-7496. Once that JIRA is
complete, FsVolume instances will be reference-counted, ensuring that we never
perform an operation on a volume that has already been removed. That fits in
nicely with the current code, because the volume removal code instructs the
volume scanner to shut down.
bq. Various races that lead to FNF too, which are already described in
ScanResultHandler.
Yeah, these already exist in the current block scanner code. Eventually maybe
we can do a spin-wait loop to make sure that we can distinguish real block FNF
from race-condition-induced FNF. But I think that's something we should do in
a follow-up... this patch is big enough :)
bq. ScanResultHandler can be package-private
ok
bq. getStatistics and getVolume can be @VisibleForTesting
{{getStatistics}}, sure. {{getVolume}} is called by {{BlockScanner}}
bq. I notice some Time.now() being used to calculate an interval in
findNextUsableBlockIter, can this be monotonicNow instead?
The iterator start time is saved in the cursor file, and it is a wall-clock
time, not a monotonic time. This is necessary because if we shut down the
datanode and reboot, we want to be able to pick up in our scan where we left
off. But monotonic time resets every time we reboot. I added a comment in
{{findNextUsableBlockIter}} to clarify this.
bq. scanBlock, the first series of try/catch, is this going to lead to lots of
log prints? I believe passing in the exception will print a stack trace. I
noticed ScanResultHandler quashes a lot of these.
Yeah, let's not print the stack trace... it's not interesting here. Also, I
will log a message when {{FsVolumeSpi#getStoredBlock}} returns null instead of
throwing {{FileNotFoundException}} (apparently it's allowed to do that too :P )
bq. Seems like a mismatch here, since this is subtracting the average num bytes
scanned per minute, not per second
Yikes. Yeah, we need the average bytes per *second* here, not the average
bytes per *minute*, which is what we'd get from just dividing the bytes in the
last hour by 60. Fixed.
bq. if (neededBytesPerSec <= conf.thresholdBytesPerSec) { ... should be just <
Yeah, let's use < instead. This will also allow setting the threshold to 0 to
get continuous scan.
bq. Comment seems unrelated.
So, actually the comment is related... we're taking one {{ExtendedBlock}}
structure, which always has 0 for its genstamp, and looking up another
{{ExtendedBlock}} structure, which will have a valid genstamp. It's
frustrating that we don't have a more expressive type system that could express
something like "this struct has poolId and blockId, but not genstamp." Instead
we've got Block.java which has id / genstamp, and ExtendedBlock.java, which has
all that plus block pool id.
I will see if I can come up with a better comment in the code.
bq. Doesn't BlockIterator fit better in BlockPoolSlice, since it's tied to a
block pool?
{{BlockPoolSlice}} is internal to {{FsDatasetImpl}}. It's not a generic class,
whereas the volume scanner is intended to be able to work on any volume, not
just ones from {{FsVolumeImpl}}.
bq. Implementing Iterator and Iterable might make a lot of sense here. For
instance, hasNext is more canonical than eof.
I did consider that, but I don't think it's really useful. It's not like you'd
want to put this in a "for" loop or something. {{Iterator}} is kind of
annoying because it has a {{remove}} method (which I certainly don't want to
support here), and because {{hasNext}} is supposed to return something valid
even if you haven't called {{next}} yet. This is annoying. Another annoying
thing about {{hasNext}} is that someone could call {{hasNext}}, wait a long
time, and then call {{next}}. Then I'd have to return "something" even if all
blocks had been removed from the volume. It's a contract I don't really want
to bind myself to here.
bq. Rather than hand-rolled serde, could we use Jackson to do JSON? I think
it'll save some lines of code too.
Hmm. Let me think about that. I think it's pretty readable as-is, both in the
code and in the saved file. The serialization code is VERY short right now and
I like having a greppable file with one key per line. If I saved as JSON, I'd
want to at least pretty-print it so that we had no more than one key per line
as well. And there would be some complexity around ensuring that only
appropriate values got saved. I don't think the current format is really that
problematic, and if we want to add a field later that is a list or map, we can
always make just that line contain JSON. Another nice thing about the current
format is that it fits in with how we do {{VERSION}}.
bq. Finally, have you considered implementing this as an iterator-of-iterators?
The complexity in this code comes from advancing curDir and curSubDir as
appropriate, but we could encapsulate the logic for each in its own iterator,
e.g. BlockIterator -> SubdirIterator -> SubdirIterator -> SubdirBlockIterator.
Serializing would be saving the current iter position of each. Right now this
is coded in a very C-style with checking nulls.
My gut feeling is that splitting the code up would just make it longer and
probably not any less repetitious... Instead of checking whether
{{getNextCurFinalizedSubDir}} returned {{null}}, you'd check if
{{finalizedSubDirIter#next}} returned {{null}}. And all of the "encapsulated"
iterators would need references to the "encapsulating" iterators, in order to
build a path. I think it would get kind of messy.
bq. Class javadoc for BlockIterator is a bit incorrect, it returns entries from
a block pool, not the entire volume.
Fixed
bq. Unused imports
Removed
one last set of comments to address...
> Refactor the BlockScanner to use O(1) memory and use multiple threads
> ---------------------------------------------------------------------
>
> Key: HDFS-7430
> URL: https://issues.apache.org/jira/browse/HDFS-7430
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 2.7.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Attachments: HDFS-7430.002.patch, HDFS-7430.003.patch,
> HDFS-7430.004.patch, HDFS-7430.005.patch, memory.png
>
>
> We should update the BlockScanner to use a constant amount of memory by
> keeping track of what block was scanned last, rather than by tracking the
> scan status of all blocks in memory. Also, instead of having just one
> thread, we should have a verification thread per hard disk (or other volume),
> scanning at a configurable rate of bytes per second.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)