[
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511216#comment-16511216
]
Barbara Eckman commented on ATLAS-2708:
---------------------------------------
[~bosco], Great, I'm glad they will be useful.
1. We actually didn't model each object (file), because we don't plan to store
metadata on each object. But I can whip up an object typedef for you. I'd
imagine attributes like: pseudodir it's in, compression format, creation time.
For us they'd be .avro files, so I'd include an avro schema. For you and
others: would you have a .csv "schema" associated with the object? Or a JSON
schema? I could include an attribute of type Schema rather than avro_schema, so
any schema can be associated with it. Any other attributes you can think of?
2. Do you mean there'd be a separate Jira ticket for AWS common Entities like
Tags, Permissions, CloudWatchMetrics, etc, instead of lumping them together
with the S3 entities? Then AWSS3Bucket and AWSDynamoDB could each have
attributes of type AWSTag. In terms of CloudWatchMetrics, for Dynamo, are
metric name and scope sufficient? I'd have to check.
> AWS S3 data lake typedefs for Atlas
> -----------------------------------
>
> Key: ATLAS-2708
> URL: https://issues.apache.org/jira/browse/ATLAS-2708
> Project: Atlas
> Issue Type: New Feature
> Components: atlas-core
> Reporter: Barbara Eckman
> Assignee: Barbara Eckman
> Priority: Critical
> Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It
> would be nice to add typedefs for AWS data lake objects (buckets and
> pseudo-directories) and lineage processes that move the data from another
> source (e.g., kafka topic) to the data lake. For example:
> * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in
> an S3 bucket. For example, in the case of an object with key
> “myWork/Development/Projects1.xls”, “myWork/Development” is the
> pseudo-directory. It supports:
> ** Array of avro schemas that are associated with the data in the
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
> ** what type of data it contains, e.g., avro, json, unstructured
> ** time of creation
> * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of
> the data in a bucket to a storageClass after a specific time interval, or
> expiration. For example, transition to GLACIER after 60 days, or expire
> (i.e. be deleted) after 90 days:
> ** ruleType (e.g., transition or expiration)
> ** time interval in days before rule is executed
> ** storageClass to which the data is transitioned (null if ruleType is
> expiration)
> * AWSTag type represents a tag-value pair created by the user and associated
> with an AWS object.
> ** tag
> ** value
> * AWSCloudWatchMetric type represents a storage or request metric that is
> monitored by AWS CloudWatch and can be configured for a bucket
> ** metricName, for example, “AllRequests”, “GetRequests”,
> TotalRequestLatency, BucketSizeBytes
> ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or
> limit the monitoring of the metric.
> * AWSS3Bucket type represents a bucket in an S3 instance. It supports:
> ** Array of AWSS3PseudoDirectories that are associated with objects stored
> in the bucket
> ** AWS region
> ** IsEncrypted (boolean)
> ** encryptionType, e.g., AES-256
> ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject,
> PutObject
> ** time of creation
> ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket
> ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or
> its tags or prefixes
> ** Array of AWSTags that are associated with the bucket
> * Generic dataset2Dataset process to represent movement of data from one
> dataset to another. It supports:
> ** array of transforms performed by the process
> ** map of tag/value pairs representing configurationParameters of the process
> ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and
> S3 pseudo-directory.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)