steveloughran commented on code in PR #7834: URL: https://github.com/apache/hadoop/pull/7834#discussion_r2242160657
########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3aTagging.md: ########## @@ -0,0 +1,298 @@ +# S3 Object Tagging Support in Hadoop S3A Filesystem + +## Overview + +The Hadoop S3A filesystem connector now supports S3 object tagging, allowing users to automatically assign metadata tags to S3 objects during creation and soft deletion operations. This feature enables better data organization, cost allocation, access control, and lifecycle management for S3-stored data. + +**JIRA Issue**: [HADOOP-19536](https://issues.apache.org/jira/browse/HADOOP-19536#s3-tags) + +## Table of Contents + +- [Motivation](#motivation) +- [S3 Object Tagging Capabilities](#s3-object-tagging-capabilities) +- [Use Cases](#use-cases) +- [Configuration](#configuration) +- [Usage Examples](#usage-examples) +- [Soft Delete Feature](#soft-delete-feature) +- [Best Practices](#best-practices) +- [Limitations](#limitations) + +## Motivation + +Amazon S3 supports tagging objects with key-value pairs, providing several critical benefits: + +1. **Cost Allocation**: Track and allocate S3 storage costs across departments, projects, or cost centers +2. **Access Control**: Use tags in IAM policies to control object access permissions +3. **Lifecycle Management**: Trigger automated lifecycle policies for object transitions and expiration +4. **Data Classification**: Organize and classify data for compliance, security, and business requirements +5. **Analytics and Reporting**: Enable detailed analytics and reporting based on object metadata + +Previously, the Hadoop S3A connector lacked native support for object tagging, requiring users to implement custom solutions or use separate tools to tag objects post-creation. + +## S3 Object Tagging Capabilities + +### Tag Specifications +- **Maximum Tags**: Up to 10 tags per object +- **Structure**: Key-value pairs +- **Key Length**: Up to 128 Unicode characters +- **Value Length**: Up to 256 Unicode characters +- **Case Sensitivity**: Keys and values are case-sensitive +- **Uniqueness**: Tag keys must be unique per object (no duplicate keys) + +### Allowed Characters +Tag keys and values can contain: +- Letters (a-z, A-Z) +- Numbers (0-9) +- Spaces +- Special symbols: `. : + - = _ / @` + +## Use Cases + +### 1. Access Control with IAM Policies + +Control object access based on tags: + +```json +{ + "Effect": "Allow", + "Action": "s3:GetObject", + "Resource": "*", + "Condition": { + "StringEquals": { + "s3:ExistingObjectTag/department": "finance" + } + } +} +``` + +### 2. Lifecycle Management + +Trigger lifecycle rules based on tags: + +```json +{ + "Rules": [ + { + "Status": "Enabled", + "Filter": { + "Tag": { + "Key": "retention", + "Value": "temporary" + } + }, + "Expiration": { + "Days": 30 + } + } + ] +} +``` + +### 3. Cost Allocation and Tracking + +- Use tags for cost tracking in AWS Cost Explorer +- Allocate costs across different business units or projects +- Generate detailed billing reports by tag dimensions + +### 4. Data Analytics and Filtering + +- Use S3 Analytics to filter and analyze data by tags +- Create custom reports based on tagged object metadata +- Enable data governance and compliance reporting + +## Configuration + +### Object Creation Tags + +#### Method 1: Comma-Separated List +```properties +fs.s3a.object.tags=department=finance,project=alpha,owner=data-team +``` + +#### Method 2: Individual Tag Properties +```properties +fs.s3a.object.tag.department=finance +fs.s3a.object.tag.project=alpha +fs.s3a.object.tag.owner=data-team +fs.s3a.object.tag.environment=production +``` + +### Soft Delete Tags +```properties +fs.s3a.soft.delete.enabled=true +fs.s3a.soft.delete.tag.key=archive +fs.s3a.soft.delete.tag.value=true +``` + +## Usage Examples + +### Spark Applications + +#### Using Comma-Separated Tags +```bash +spark-submit \ + --conf spark.hadoop.fs.s3a.object.tags=department=finance,project=alpha,environment=prod \ + --class MySparkApp \ + my-app.jar +``` + +#### Using Individual Tag Configurations +```bash +spark-submit \ + --conf spark.hadoop.fs.s3a.object.tag.department=finance \ + --conf spark.hadoop.fs.s3a.object.tag.project=alpha \ + --conf spark.hadoop.fs.s3a.object.tag.owner=data-team \ + --conf spark.hadoop.fs.s3a.object.tag.cost-center=engineering \ + --class MySparkApp \ + my-app.jar +``` + +### Hadoop Commands + +#### File Upload with Tags +```bash +hadoop fs \ + -Dfs.s3a.object.tag.department=finance \ + -Dfs.s3a.object.tag.project=quarterly-report \ + -put local-file.txt s3a://my-bucket/reports/ +``` + +#### Directory Operations with Tags +```bash +hadoop fs \ + -Dfs.s3a.object.tags=team=analytics,retention=long-term \ + -put /local/data/ s3a://my-bucket/analytics/ +``` + +### MapReduce Jobs + +```bash +hadoop jar my-job.jar \ + -Dfs.s3a.object.tag.job-type=etl \ + -Dfs.s3a.object.tag.priority=high \ + input s3a://my-bucket/output/ +``` + +## Soft Delete Feature + +The soft delete feature allows you to tag objects instead of permanently deleting them, enabling data retention policies and recovery options. Review Comment: or people use versioned buckets, obviously ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3aTagging.md: ########## @@ -0,0 +1,298 @@ +# S3 Object Tagging Support in Hadoop S3A Filesystem + +## Overview + +The Hadoop S3A filesystem connector now supports S3 object tagging, allowing users to automatically assign metadata tags to S3 objects during creation and soft deletion operations. This feature enables better data organization, cost allocation, access control, and lifecycle management for S3-stored data. + +**JIRA Issue**: [HADOOP-19536](https://issues.apache.org/jira/browse/HADOOP-19536#s3-tags) + +## Table of Contents + +- [Motivation](#motivation) +- [S3 Object Tagging Capabilities](#s3-object-tagging-capabilities) +- [Use Cases](#use-cases) +- [Configuration](#configuration) +- [Usage Examples](#usage-examples) +- [Soft Delete Feature](#soft-delete-feature) +- [Best Practices](#best-practices) +- [Limitations](#limitations) + +## Motivation + +Amazon S3 supports tagging objects with key-value pairs, providing several critical benefits: + +1. **Cost Allocation**: Track and allocate S3 storage costs across departments, projects, or cost centers +2. **Access Control**: Use tags in IAM policies to control object access permissions +3. **Lifecycle Management**: Trigger automated lifecycle policies for object transitions and expiration +4. **Data Classification**: Organize and classify data for compliance, security, and business requirements +5. **Analytics and Reporting**: Enable detailed analytics and reporting based on object metadata + +Previously, the Hadoop S3A connector lacked native support for object tagging, requiring users to implement custom solutions or use separate tools to tag objects post-creation. + +## S3 Object Tagging Capabilities + +### Tag Specifications +- **Maximum Tags**: Up to 10 tags per object +- **Structure**: Key-value pairs +- **Key Length**: Up to 128 Unicode characters +- **Value Length**: Up to 256 Unicode characters +- **Case Sensitivity**: Keys and values are case-sensitive +- **Uniqueness**: Tag keys must be unique per object (no duplicate keys) + +### Allowed Characters +Tag keys and values can contain: +- Letters (a-z, A-Z) +- Numbers (0-9) +- Spaces +- Special symbols: `. : + - = _ / @` + +## Use Cases + +### 1. Access Control with IAM Policies + +Control object access based on tags: + +```json +{ + "Effect": "Allow", + "Action": "s3:GetObject", + "Resource": "*", + "Condition": { + "StringEquals": { + "s3:ExistingObjectTag/department": "finance" + } + } +} +``` + +### 2. Lifecycle Management + +Trigger lifecycle rules based on tags: + +```json +{ + "Rules": [ + { + "Status": "Enabled", + "Filter": { + "Tag": { + "Key": "retention", + "Value": "temporary" + } + }, + "Expiration": { + "Days": 30 + } + } + ] +} +``` + +### 3. Cost Allocation and Tracking + +- Use tags for cost tracking in AWS Cost Explorer +- Allocate costs across different business units or projects +- Generate detailed billing reports by tag dimensions + +### 4. Data Analytics and Filtering + +- Use S3 Analytics to filter and analyze data by tags +- Create custom reports based on tagged object metadata +- Enable data governance and compliance reporting + +## Configuration + +### Object Creation Tags + +#### Method 1: Comma-Separated List +```properties +fs.s3a.object.tags=department=finance,project=alpha,owner=data-team +``` + +#### Method 2: Individual Tag Properties +```properties +fs.s3a.object.tag.department=finance +fs.s3a.object.tag.project=alpha +fs.s3a.object.tag.owner=data-team +fs.s3a.object.tag.environment=production +``` + +### Soft Delete Tags +```properties +fs.s3a.soft.delete.enabled=true Review Comment: is the idea that when enabled, each delete(path) call is remapped to tagging the object for deletion? is this for recovery or for a performance benefit? ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3aTagging.md: ########## @@ -0,0 +1,298 @@ +# S3 Object Tagging Support in Hadoop S3A Filesystem + +## Overview + +The Hadoop S3A filesystem connector now supports S3 object tagging, allowing users to automatically assign metadata tags to S3 objects during creation and soft deletion operations. This feature enables better data organization, cost allocation, access control, and lifecycle management for S3-stored data. + +**JIRA Issue**: [HADOOP-19536](https://issues.apache.org/jira/browse/HADOOP-19536#s3-tags) + +## Table of Contents + +- [Motivation](#motivation) +- [S3 Object Tagging Capabilities](#s3-object-tagging-capabilities) +- [Use Cases](#use-cases) +- [Configuration](#configuration) +- [Usage Examples](#usage-examples) +- [Soft Delete Feature](#soft-delete-feature) +- [Best Practices](#best-practices) +- [Limitations](#limitations) + +## Motivation + +Amazon S3 supports tagging objects with key-value pairs, providing several critical benefits: + +1. **Cost Allocation**: Track and allocate S3 storage costs across departments, projects, or cost centers +2. **Access Control**: Use tags in IAM policies to control object access permissions +3. **Lifecycle Management**: Trigger automated lifecycle policies for object transitions and expiration +4. **Data Classification**: Organize and classify data for compliance, security, and business requirements +5. **Analytics and Reporting**: Enable detailed analytics and reporting based on object metadata + +Previously, the Hadoop S3A connector lacked native support for object tagging, requiring users to implement custom solutions or use separate tools to tag objects post-creation. + +## S3 Object Tagging Capabilities + +### Tag Specifications +- **Maximum Tags**: Up to 10 tags per object +- **Structure**: Key-value pairs +- **Key Length**: Up to 128 Unicode characters +- **Value Length**: Up to 256 Unicode characters +- **Case Sensitivity**: Keys and values are case-sensitive +- **Uniqueness**: Tag keys must be unique per object (no duplicate keys) + +### Allowed Characters +Tag keys and values can contain: +- Letters (a-z, A-Z) +- Numbers (0-9) +- Spaces +- Special symbols: `. : + - = _ / @` + +## Use Cases + +### 1. Access Control with IAM Policies + +Control object access based on tags: + +```json +{ + "Effect": "Allow", + "Action": "s3:GetObject", + "Resource": "*", + "Condition": { + "StringEquals": { + "s3:ExistingObjectTag/department": "finance" + } + } +} +``` + +### 2. Lifecycle Management + +Trigger lifecycle rules based on tags: + +```json +{ + "Rules": [ + { + "Status": "Enabled", + "Filter": { + "Tag": { + "Key": "retention", + "Value": "temporary" + } + }, + "Expiration": { + "Days": 30 + } + } + ] +} +``` + +### 3. Cost Allocation and Tracking + +- Use tags for cost tracking in AWS Cost Explorer +- Allocate costs across different business units or projects +- Generate detailed billing reports by tag dimensions + +### 4. Data Analytics and Filtering + +- Use S3 Analytics to filter and analyze data by tags +- Create custom reports based on tagged object metadata +- Enable data governance and compliance reporting + +## Configuration + +### Object Creation Tags + +#### Method 1: Comma-Separated List +```properties +fs.s3a.object.tags=department=finance,project=alpha,owner=data-team +``` + +#### Method 2: Individual Tag Properties +```properties +fs.s3a.object.tag.department=finance +fs.s3a.object.tag.project=alpha +fs.s3a.object.tag.owner=data-team +fs.s3a.object.tag.environment=production +``` + +### Soft Delete Tags +```properties +fs.s3a.soft.delete.enabled=true +fs.s3a.soft.delete.tag.key=archive +fs.s3a.soft.delete.tag.value=true +``` + +## Usage Examples + +### Spark Applications + +#### Using Comma-Separated Tags +```bash +spark-submit \ + --conf spark.hadoop.fs.s3a.object.tags=department=finance,project=alpha,environment=prod \ + --class MySparkApp \ + my-app.jar +``` + +#### Using Individual Tag Configurations +```bash +spark-submit \ + --conf spark.hadoop.fs.s3a.object.tag.department=finance \ + --conf spark.hadoop.fs.s3a.object.tag.project=alpha \ + --conf spark.hadoop.fs.s3a.object.tag.owner=data-team \ + --conf spark.hadoop.fs.s3a.object.tag.cost-center=engineering \ + --class MySparkApp \ + my-app.jar +``` + +### Hadoop Commands + +#### File Upload with Tags +```bash +hadoop fs \ + -Dfs.s3a.object.tag.department=finance \ + -Dfs.s3a.object.tag.project=quarterly-report \ + -put local-file.txt s3a://my-bucket/reports/ +``` + +#### Directory Operations with Tags +```bash +hadoop fs \ + -Dfs.s3a.object.tags=team=analytics,retention=long-term \ + -put /local/data/ s3a://my-bucket/analytics/ +``` + +### MapReduce Jobs + +```bash +hadoop jar my-job.jar \ + -Dfs.s3a.object.tag.job-type=etl \ + -Dfs.s3a.object.tag.priority=high \ + input s3a://my-bucket/output/ +``` + +## Soft Delete Feature + +The soft delete feature allows you to tag objects instead of permanently deleting them, enabling data retention policies and recovery options. + +### Important Behavior Notes + +- **Default Tags**: If no tag key and value are specified, default tags are used as defined in the configuration +- **Tag Replacement**: When soft delete is performed, **all existing tags on the object are removed** and replaced with only the soft delete tag specified by the user + +### Current Implementation + +```bash +# Using custom soft delete tags +hadoop fs \ + -Dfs.s3a.soft.delete.enabled=true \ + -Dfs.s3a.soft.delete.tag.key=archive \ + -Dfs.s3a.soft.delete.tag.value=true \ + -rm s3a://my-bucket/file-to-archive.txt + +# Using default soft delete tags (if configured) +hadoop fs \ + -Dfs.s3a.soft.delete.enabled=true \ + -rm s3a://my-bucket/file-to-archive.txt +``` + +### Future Capabilities (Planned) Review Comment: don't make these commitments as they get complex fast. e.g * what if a path has nothing but soft deleted files underneath. Does it exist? can I do a non-recursive rm of a directory with nothing but soft delete entries underneath? we'd reject now as a LIST call would say stuff is there, and we don't do a HEAD on each file looking for a soft-delete marker. * what if I rename a soft-deleted file? does it come back into existence? * what if I create a file, the header is set to not create if a file is there, but there's a soft deleted entry? Better to say While tagged as soft delete, the files are still visible to filesystem operations such as list and create. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org