[ 
https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512845#comment-16512845
 ] 

Don Bosco Durai commented on ATLAS-2708:
----------------------------------------

[~barbara], my comments are inline
{quote}bq. 1.  We actually didn't model each object (file), because we don't 
plan to store metadata on each object.  But I can whip up an object typedef for 
you.  
{quote}
Thanks for the clarification. After giving some more thought, it felt, the S3 
hierarchy could be similar to database model. E.g. hive_db -> hive_table -> 
hive_column. Out here, it would be S3Bucket ->Pseudodir -> S3Object. Because at 
each level there could be different attributes and properties, so we should be 
able to capture it. Currently, in your current model, it seems we have lists of 
AWSS3PseudoDir within AWSS3Bucket, whereas in hive_model.json, hive_table has 
reference to hive_db (however, hive_table has array of hive_columns also).

In your case, the S3Object could be optional.
{quote}I could include an attribute of type Schema rather than 
avro_schema,associated with it.
{quote}
Thanks for pointing out. I didn't realize that AVRO was first class attribute. 
Yes, it makes sense to have it more generic. 
{quote}Any other attributes you can think of?
{quote}
I can think of compression type for now. If it is CSV, then it might need 
delimiter also. But should these be dynamic attributes? I would let the others 
comment on the best practice for modeling.

 
{quote}bq. 2. Do you mean there'd be a separate Jira ticket for AWS common 
Entities like Tags, Permissions, CloudWatchMetrics, etc, instead of lumping 
them together with the S3 entities?  Then AWSS3Bucket and AWSDynamoDB could 
each have attributes of type AWSTag.  In terms of CloudWatchMetrics, for 
Dynamo, are metric name and scope sufficient?  I'd have to check.  
{quote}
Yes, you are right. In some cases, it could be an array of entities (e.g. 
AWSTag) and in some cases, it could be just extending from a base class like 
fs_path and hdfs_path. Regardless it would be good to abstract them out.

For now, it could be in the same json and part this JIRA.

 

 

> AWS S3 data lake typedefs for Atlas
> -----------------------------------
>
>                 Key: ATLAS-2708
>                 URL: https://issues.apache.org/jira/browse/ATLAS-2708
>             Project: Atlas
>          Issue Type: New Feature
>          Components:  atlas-core
>            Reporter: Barbara Eckman
>            Assignee: Barbara Eckman
>            Priority: Critical
>         Attachments: all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It 
> would be nice to add typedefs for AWS data lake objects (buckets and 
> pseudo-directories) and lineage processes that move the data from another 
> source (e.g., kafka topic) to the data lake.  For example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in 
> an S3 bucket.  For example, in the case of an object with key 
> “myWork/Development/Projects1.xls”, “myWork/Development” is the 
> pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the 
> pseudo-directory (based on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of 
> the data in a bucket to a storageClass after a specific time interval, or 
> expiration.  For example, transition to GLACIER after 60 days, or expire 
> (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is 
> expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated 
> with an AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is 
> monitored by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, 
> TotalRequestLatency, BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or 
> limit the monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored 
> in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, 
> PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or 
> its tags or prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one 
> dataset to another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and 
> S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to