[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Fabbri updated HADOOP-14041: -- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to HADOOP-13345 feature branch. Thank you for the great work on this [~mackrorysd] and thanks [~ste...@apache.org] for the feedback on exceptions. > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, > HADOOP-14041-HADOOP-13345.006.patch, HADOOP-14041-HADOOP-13345.007.patch, > HADOOP-14041-HADOOP-13345.008.patch, HADOOP-14041-HADOOP-13345.009.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Fabbri updated HADOOP-14041: -- Status: Patch Available (was: Open) > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, > HADOOP-14041-HADOOP-13345.006.patch, HADOOP-14041-HADOOP-13345.007.patch, > HADOOP-14041-HADOOP-13345.008.patch, HADOOP-14041-HADOOP-13345.009.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Fabbri updated HADOOP-14041: -- Status: Open (was: Patch Available) > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, > HADOOP-14041-HADOOP-13345.006.patch, HADOOP-14041-HADOOP-13345.007.patch, > HADOOP-14041-HADOOP-13345.008.patch, HADOOP-14041-HADOOP-13345.009.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Fabbri updated HADOOP-14041: -- Attachment: HADOOP-14041-HADOOP-13345.009.patch v9 patch.. - Rebase on latest feature branch after DDB throttling and listStatus fix were committed. - Change IOException to InterruptedIOException in DDB prune(). > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, > HADOOP-14041-HADOOP-13345.006.patch, HADOOP-14041-HADOOP-13345.007.patch, > HADOOP-14041-HADOOP-13345.008.patch, HADOOP-14041-HADOOP-13345.009.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Fabbri updated HADOOP-14041: -- Attachment: HADOOP-14041-HADOOP-13345.008.patch > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, > HADOOP-14041-HADOOP-13345.006.patch, HADOOP-14041-HADOOP-13345.007.patch, > HADOOP-14041-HADOOP-13345.008.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Mackrory updated HADOOP-14041: --- Attachment: HADOOP-14041-HADOOP-13345.007.patch > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, > HADOOP-14041-HADOOP-13345.006.patch, HADOOP-14041-HADOOP-13345.007.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Mackrory updated HADOOP-14041: --- Attachment: HADOOP-14041-HADOOP-13345.006.patch Thanks for the reviews, all - good stuff. The problems [~fabbri] saw boil down to 2 things, one of which I fixed: I had not tested this with anything being inferred from an S3 path, and I wasn't trying to parse and use that like the other commands. That is now fixed and added to the tests. The other thing is that it appears to not be parsing generic options (which does indeed seem wrong - according to the docs, if you implement Tool you should get that for free - and we do), but the behavior wouldn't be what you expect anyway because it will set the table config based on the -m flag or the S3 path you provide. I think the CLI behavior is badly defined here in general, so I've filed HADOOP-14094 to really rethink what options are exposed and how. I like [~ste...@apache.org]'s recommendation to just throw the IOException. I think what I was thinking was that if there's an issue deleting one row, we can keep retrying the others. But I think an exception that affects one row but not subsequent others is probably unlikely, so it's worth bubbling that up so we know about the problem. Also, removing that block highlighted that my batching logic was bad: instead of processing complete batches inside the loop and processing whatever is left over afterwards, I was effectively always processing whatever contents the batch had at the end of each iteration. That's been fixed, and I tested the number of events was correct with several hundred objects getting pruned. On a related note, I also changed the log message to INFO and had it count items and report batch size rather than just the number of batches. Without that the last message you get out-of-the-box on the CLI is that the metastore has been initialized, which is misleading. It will now log when the metadatastore connection has been initialized and then finish off by logging how many items were deleted and what he batch size was. I think that's more friendly: and probably something we want to do more of for the other commands if / when we rethink the interface. > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch, > HADOOP-14041-HADOOP-13345.006.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Mackrory updated HADOOP-14041: --- Attachment: HADOOP-14041-HADOOP-13345.005.patch > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch, HADOOP-14041-HADOOP-13345.005.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Mackrory updated HADOOP-14041: --- Status: Patch Available (was: Open) > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Mackrory updated HADOOP-14041: --- Attachment: HADOOP-14041-HADOOP-13345.004.patch {quote}Minor nit: I would make sleep happen between batches (not before the first).{quote} I went with this because the first batch immediately follows a large request, so it would be appropriate to pause for a breath anyway. The 25ms delay is entirely arbitrary - if other folks have opinions on a better default, I'd love to have some reasoning for what we go with. If anything I suspect it should probably be longer. Attaching a patch that address all other feedback from [~fabbri] > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch, > HADOOP-14041-HADOOP-13345.004.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Mackrory updated HADOOP-14041: --- Attachment: HADOOP-14041-HADOOP-13345.003.patch Okay - let's punt on directory pruning for now. Adding mod_time to directories or addressing the other concerns you've raised would be a big enough deal to be a separate issue, I think. So I'll file JIRAs to address that if we go with the approach I just attached. In this patch I remove the directory logic, but add back a test to confirm that directories are not getting removed yet (with a comment about why we check for that behavior). I also got a request to make the default age configurable (e.g. in .xml files) so individual commands don't need to specify it. I think that's a good idea, but I'm not sure if there's a precedent / established naming convention for that in CLI commands. I think it makes a lot of sense to do that, especially since if you don't specify things like the metastore table / endpoint it does fall back on what's configured. I still have it so you can override with "-H 12", for instance, on the command-line. > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch, HADOOP-14041-HADOOP-13345.003.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Mackrory updated HADOOP-14041: --- Attachment: HADOOP-14041-HADOOP-13345.002.patch Attaching a patch that has added documentation, and better handling of directories. The behavior is that if a directory's contents gets pruned and is now empty, we will also prune that directory (and so on, for its parents, as long as they're empty). Added tests to cover those considerations. Tested against us-east-1. Not sure if I've mentioned these already, but I'm seeing failures in the tests of the table versioning. Pretty certain it's not related, so just calling it out. I'll file a JIRA about it. {code} TestDynamoDBMetadataStore.testTableVersionMismatch:382->verifyTableInitialized:487 » ResourceNotFound TestDynamoDBMetadataStore.testTableVersionRequired:367->verifyTableInitialized:487 » ResourceNotFound {code} > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch, > HADOOP-14041-HADOOP-13345.002.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-14041) CLI command to prune old metadata
[ https://issues.apache.org/jira/browse/HADOOP-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Mackrory updated HADOOP-14041: --- Attachment: HADOOP-14041-HADOOP-13345.001.patch Attaching a patch that adds prune(timestamp) to the MetadataStore interface and existing implementations, a CLI tool, and tests for all of that. prune() takes a UTC timestamp as returned by System.currentTimeMillis() and should trim everything with a modification time older than that. The CLI tool determines the timestamp by taking the current time and subtracting various lengths of time. One tricky thing is you can specify minutes with -M, and all the time ranges are in caps so that doesn't clash with -m for specifying the metastore URL. One thing that probably needs more work is what to do about directories. The local implementation will delete its record of a directory if all the files it tracks in that directory get pruned. I should at least do the equivalent for the DynamoDB implementation, but since there's been some special consideration for handling empty directories that may warrant some more thought. I know [~fabbri]'s been thinking about the nuances of empty directories - any thoughts on that? All tests pass except as currently documented in other JIRAs. I did for a time have a lot of tests fail at the assertion of type S3AFileStatus in PathMetadataDynamoDBTranslation.pathMetadataToItem. Indeed, we do have a lot of instances of FileStatus (S3AFileStatus' parent class) flying around S3Guard, so I'm surprised I don't get it consistently, but today all the tests are passing. I can't see how anything I've changed while working on this patch would impact it. So just throwing this out there in case others have seen it or have any insight. > CLI command to prune old metadata > - > > Key: HADOOP-14041 > URL: https://issues.apache.org/jira/browse/HADOOP-14041 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 >Reporter: Sean Mackrory >Assignee: Sean Mackrory > Attachments: HADOOP-14041-HADOOP-13345.001.patch > > > Add a CLI command that allows users to specify an age at which to prune > metadata that hasn't been modified for an extended period of time. Since the > primary use-case targeted at the moment is list consistency, it would make > sense (especially when authoritative=false) to prune metadata that is > expected to have become consistent a long time ago. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org