Barbara Eckman created ATLAS-2708:
-------------------------------------
Summary: AWS S3 data lake typedefs for Atlas
Key: ATLAS-2708
URL: https://issues.apache.org/jira/browse/ATLAS-2708
Project: Atlas
Issue Type: New Feature
Components: atlas-core
Reporter: Barbara Eckman
Currently the base types in Atlas do not include AWS data lake objects. It
would be nice to add typedefs for AWS data lake objects (buckets and
pseudo-directories) and lineage processes that move the data from another
source (e.g., kafka topic) to the data lake. For example:
* AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in
an S3 bucket. For example, in the case of an object with key
“myWork/Development/Projects1.xls”, “myWork/Development” is the
pseudo-directory. It supports:
** Array of avro schemas that are associated with the data in the
pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
** what type of data it contains, e.g., avro, json, unstructured
** time of creation
* AWSS3BucketLifeCycleRule type represents a rule specifying a transition of
the data in a bucket to a storageClass after a specific time interval, or
expiration. For example, transition to GLACIER after 60 days, or expire (i.e.
be deleted) after 90 days:
** ruleType (e.g., transition or expiration)
** time interval in days before rule is executed
** storageClass to which the data is transitioned (null if ruleType is
expiration)
* AWSTag type represents a tag-value pair created by the user and associated
with an AWS object.
** tag
** value
* AWSCloudWatchMetric type represents a storage or request metric that is
monitored by AWS CloudWatch and can be configured for a bucket
** metricName, for example, “AllRequests”, “GetRequests”, TotalRequestLatency,
BucketSizeBytes
** scope: null if entire bucket; otherwise, the prefixes/tags that filter or
limit the monitoring of the metric.
* AWSS3Bucket type represents a bucket in an S3 instance. It supports:
** Array of AWSS3PseudoDirectories that are associated with objects stored in
the bucket
** AWS region
** IsEncrypted (boolean)
** encryptionType, e.g., AES-256
** S3AccessPolicy, a JSON object expressing access policies, eg GetObject,
PutObject
** time of creation
** Array of AWSS3BucketLifeCycleRules that are associated with the bucket
** Array of AWSS3CloudWatchMetrics that are associated with the bucket or its
tags or prefixes
** Array of AWSTags that are associated with the bucket
* Generic dataset2Dataset process to represent movement of data from one
dataset to another. It supports:
** array of transforms performed by the process
** map of tag/value pairs representing configurationParameters of the process
** inputs and outputs are arrays of dataset objects, e.g., kafka topic and S3
pseudo-directory.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)