Barbara Eckman created ATLAS-2708:
-------------------------------------

             Summary: AWS S3 data lake typedefs for Atlas
                 Key: ATLAS-2708
                 URL: https://issues.apache.org/jira/browse/ATLAS-2708
             Project: Atlas
          Issue Type: New Feature
          Components:  atlas-core
            Reporter: Barbara Eckman


Currently the base types in Atlas do not include AWS data lake objects. It 
would be nice to add typedefs for AWS data lake objects (buckets and 
pseudo-directories) and lineage processes that move the data from another 
source (e.g., kafka topic) to the data lake.  For example:
 * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
an S3 bucket.  For example, in the case of an object with key 
“myWork/Development/Projects1.xls”, “myWork/Development” is the 
pseudo-directory.  It supports:
 ** Array of avro schemas that are associated with the data in the 
pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
 ** what type of data it contains, e.g., avro, json, unstructured
 ** time of creation

 * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
the data in a bucket to a storageClass after a specific time interval, or 
expiration.  For example, transition to GLACIER after 60 days, or expire (i.e. 
be deleted) after 90 days:
 ** ruleType (e.g., transition or expiration)
 ** time interval in days before rule is executed  
 ** storageClass to which the data is transitioned (null if ruleType is 
expiration)

 * AWSTag type represents a tag-value pair created by the user and associated 
with an AWS object.
 **  tag
 ** value

 * AWSCloudWatchMetric type represents a storage or request metric that is 
monitored by AWS CloudWatch and can be configured for a bucket
 ** metricName, for example, “AllRequests”, “GetRequests”, TotalRequestLatency, 
BucketSizeBytes
 ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
limit the monitoring of the metric.

 * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
 ** Array of AWSS3PseudoDirectories that are associated with objects stored in 
the bucket 
 ** AWS region
 ** IsEncrypted (boolean) 
 ** encryptionType, e.g., AES-256
 ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
PutObject
 ** time of creation
 ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
 ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or its 
tags or prefixes
 ** Array of AWSTags that are associated with the bucket

 * Generic dataset2Dataset process to represent movement of data from one 
dataset to another.  It supports:
 ** array of transforms performed by the process 
 ** map of tag/value pairs representing configurationParameters of the process
 ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and S3 
pseudo-directory.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to