[
https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Mackrory updated HADOOP-14041:
-----------------------------------
Attachment: HADOOP-14041-HADOOP-13345.006.patch
Thanks for the reviews, all - good stuff.
The problems [~fabbri] saw boil down to 2 things, one of which I fixed: I had
not tested this with anything being inferred from an S3 path, and I wasn't
trying to parse and use that like the other commands. That is now fixed and
added to the tests. The other thing is that it appears to not be parsing
generic options (which does indeed seem wrong - according to the docs, if you
implement Tool you should get that for free - and we do), but the behavior
wouldn't be what you expect anyway because it will set the table config based
on the -m flag or the S3 path you provide. I think the CLI behavior is badly
defined here in general, so I've filed HADOOP-14094 to really rethink what
options are exposed and how.
I like [[email protected]]'s recommendation to just throw the IOException. I
think what I was thinking was that if there's an issue deleting one row, we can
keep retrying the others. But I think an exception that affects one row but not
subsequent others is probably unlikely, so it's worth bubbling that up so we
know about the problem. Also, removing that block highlighted that my batching
logic was bad: instead of processing complete batches inside the loop and
processing whatever is left over afterwards, I was effectively always
processing whatever contents the batch had at the end of each iteration. That's
been fixed, and I tested the number of events was correct with several hundred
objects getting pruned.
On a related note, I also changed the log message to INFO and had it count
items and report batch size rather than just the number of batches. Without
that the last message you get out-of-the-box on the CLI is that the metastore
has been initialized, which is misleading. It will now log when the
metadatastore connection has been initialized and then finish off by logging
how many items were deleted and what he batch size was. I think that's more
friendly: and probably something we want to do more of for the other commands
if / when we rethink the interface.
> CLI command to prune old metadata
> ---------------------------------
>
> Key: HADOOP-14041
> URL: https://issues.apache.org/jira/browse/HADOOP-14041
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Reporter: Sean Mackrory
> Assignee: Sean Mackrory
> Attachments: HADOOP-14041-HADOOP-13345.001.patch,
> HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch,
> HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch,
> HADOOP-14041-HADOOP-13345.006.patch
>
>
> Add a CLI command that allows users to specify an age at which to prune
> metadata that hasn't been modified for an extended period of time. Since the
> primary use-case targeted at the moment is list consistency, it would make
> sense (especially when authoritative=false) to prune metadata that is
> expected to have become consistent a long time ago.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]