[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Mackrory updated HADOOP-14041: ----------------------------------- Attachment: HADOOP-14041-HADOOP-13345.006.patch Thanks for the reviews, all - good stuff. The problems [~fabbri] saw boil down to 2 things, one of which I fixed: I had not tested this with anything being inferred from an S3 path, and I wasn't trying to parse and use that like the other commands. That is now fixed and added to the tests. The other thing is that it appears to not be parsing generic options (which does indeed seem wrong - according to the docs, if you implement Tool you should get that for free - and we do), but the behavior wouldn't be what you expect anyway because it will set the table config based on the -m flag or the S3 path you provide. I think the CLI behavior is badly defined here in general, so I've filed HADOOP-14094 to really rethink what options are exposed and how. I like [~ste...@apache.org]'s recommendation to just throw the IOException. I think what I was thinking was that if there's an issue deleting one row, we can keep retrying the others. But I think an exception that affects one row but not subsequent others is probably unlikely, so it's worth bubbling that up so we know about the problem. Also, removing that block highlighted that my batching logic was bad: instead of processing complete batches inside the loop and processing whatever is left over afterwards, I was effectively always processing whatever contents the batch had at the end of each iteration. That's been fixed, and I tested the number of events was correct with several hundred objects getting pruned. On a related note, I also changed the log message to INFO and had it count items and report batch size rather than just the number of batches. Without that the last message you get out-of-the-box on the CLI is that the metastore has been initialized, which is misleading. It will now log when the metadatastore connection has been initialized and then finish off by logging how many items were deleted and what he batch size was. I think that's more friendly: and probably something we want to do more of for the other commands if / when we rethink the interface. > CLI command to prune old metadata > --------------------------------- > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Reporter: Sean Mackrory > Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, > HADOOP-14041-HADOOP-13345.006.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org