ayush1300 commented on code in PR #7834: URL: https://github.com/apache/hadoop/pull/7834#discussion_r2247806685
########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3aTagging.md: ########## @@ -0,0 +1,298 @@ +# S3 Object Tagging Support in Hadoop S3A Filesystem + +## Overview + +The Hadoop S3A filesystem connector now supports S3 object tagging, allowing users to automatically assign metadata tags to S3 objects during creation and soft deletion operations. This feature enables better data organization, cost allocation, access control, and lifecycle management for S3-stored data. + +**JIRA Issue**: [HADOOP-19536](https://issues.apache.org/jira/browse/HADOOP-19536#s3-tags) + +## Table of Contents + +- [Motivation](#motivation) +- [S3 Object Tagging Capabilities](#s3-object-tagging-capabilities) +- [Use Cases](#use-cases) +- [Configuration](#configuration) +- [Usage Examples](#usage-examples) +- [Soft Delete Feature](#soft-delete-feature) +- [Best Practices](#best-practices) +- [Limitations](#limitations) + +## Motivation + +Amazon S3 supports tagging objects with key-value pairs, providing several critical benefits: + +1. **Cost Allocation**: Track and allocate S3 storage costs across departments, projects, or cost centers +2. **Access Control**: Use tags in IAM policies to control object access permissions +3. **Lifecycle Management**: Trigger automated lifecycle policies for object transitions and expiration +4. **Data Classification**: Organize and classify data for compliance, security, and business requirements +5. **Analytics and Reporting**: Enable detailed analytics and reporting based on object metadata + +Previously, the Hadoop S3A connector lacked native support for object tagging, requiring users to implement custom solutions or use separate tools to tag objects post-creation. + +## S3 Object Tagging Capabilities + +### Tag Specifications +- **Maximum Tags**: Up to 10 tags per object +- **Structure**: Key-value pairs +- **Key Length**: Up to 128 Unicode characters +- **Value Length**: Up to 256 Unicode characters +- **Case Sensitivity**: Keys and values are case-sensitive +- **Uniqueness**: Tag keys must be unique per object (no duplicate keys) + +### Allowed Characters +Tag keys and values can contain: +- Letters (a-z, A-Z) +- Numbers (0-9) +- Spaces +- Special symbols: `. : + - = _ / @` + +## Use Cases + +### 1. Access Control with IAM Policies + +Control object access based on tags: + +```json +{ + "Effect": "Allow", + "Action": "s3:GetObject", + "Resource": "*", + "Condition": { + "StringEquals": { + "s3:ExistingObjectTag/department": "finance" + } + } +} +``` + +### 2. Lifecycle Management + +Trigger lifecycle rules based on tags: + +```json +{ + "Rules": [ + { + "Status": "Enabled", + "Filter": { + "Tag": { + "Key": "retention", + "Value": "temporary" + } + }, + "Expiration": { + "Days": 30 + } + } + ] +} +``` + +### 3. Cost Allocation and Tracking + +- Use tags for cost tracking in AWS Cost Explorer +- Allocate costs across different business units or projects +- Generate detailed billing reports by tag dimensions + +### 4. Data Analytics and Filtering + +- Use S3 Analytics to filter and analyze data by tags +- Create custom reports based on tagged object metadata +- Enable data governance and compliance reporting + +## Configuration + +### Object Creation Tags + +#### Method 1: Comma-Separated List +```properties +fs.s3a.object.tags=department=finance,project=alpha,owner=data-team +``` + +#### Method 2: Individual Tag Properties +```properties +fs.s3a.object.tag.department=finance +fs.s3a.object.tag.project=alpha +fs.s3a.object.tag.owner=data-team +fs.s3a.object.tag.environment=production +``` + +### Soft Delete Tags +```properties +fs.s3a.soft.delete.enabled=true +fs.s3a.soft.delete.tag.key=archive +fs.s3a.soft.delete.tag.value=true +``` + +## Usage Examples + +### Spark Applications + +#### Using Comma-Separated Tags +```bash +spark-submit \ + --conf spark.hadoop.fs.s3a.object.tags=department=finance,project=alpha,environment=prod \ + --class MySparkApp \ + my-app.jar +``` + +#### Using Individual Tag Configurations +```bash +spark-submit \ + --conf spark.hadoop.fs.s3a.object.tag.department=finance \ + --conf spark.hadoop.fs.s3a.object.tag.project=alpha \ + --conf spark.hadoop.fs.s3a.object.tag.owner=data-team \ + --conf spark.hadoop.fs.s3a.object.tag.cost-center=engineering \ + --class MySparkApp \ + my-app.jar +``` + +### Hadoop Commands + +#### File Upload with Tags +```bash +hadoop fs \ + -Dfs.s3a.object.tag.department=finance \ + -Dfs.s3a.object.tag.project=quarterly-report \ + -put local-file.txt s3a://my-bucket/reports/ +``` + +#### Directory Operations with Tags +```bash +hadoop fs \ + -Dfs.s3a.object.tags=team=analytics,retention=long-term \ + -put /local/data/ s3a://my-bucket/analytics/ +``` + +### MapReduce Jobs + +```bash +hadoop jar my-job.jar \ + -Dfs.s3a.object.tag.job-type=etl \ + -Dfs.s3a.object.tag.priority=high \ + input s3a://my-bucket/output/ +``` + +## Soft Delete Feature + +The soft delete feature allows you to tag objects instead of permanently deleting them, enabling data retention policies and recovery options. + +### Important Behavior Notes + +- **Default Tags**: If no tag key and value are specified, default tags are used as defined in the configuration +- **Tag Replacement**: When soft delete is performed, **all existing tags on the object are removed** and replaced with only the soft delete tag specified by the user + +### Current Implementation + +```bash +# Using custom soft delete tags +hadoop fs \ + -Dfs.s3a.soft.delete.enabled=true \ + -Dfs.s3a.soft.delete.tag.key=archive \ + -Dfs.s3a.soft.delete.tag.value=true \ + -rm s3a://my-bucket/file-to-archive.txt + +# Using default soft delete tags (if configured) +hadoop fs \ + -Dfs.s3a.soft.delete.enabled=true \ + -rm s3a://my-bucket/file-to-archive.txt +``` + +### Future Capabilities (Planned) Review Comment: Got it, I will remove this section/rename this accordingly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org