[ 
https://issues.apache.org/jira/browse/ATLAS-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737589#comment-16737589
 ] 

Madhan Neethiraj commented on ATLAS-3006:
-----------------------------------------

Following configurations are introduced to specify the list of tables to ignore 
or prune:

Atlas Hive Hook:
{noformat}
atlas.hook.hive.hive_table.ignore.pattern
atlas.hook.hive.hive_table.prune.pattern
{noformat}

Atlas server:

{noformat}
atlas.notification.consumer.preprocess.hive_table.ignore.pattern
atlas.notification.consumer.preprocess.hive_table.prune.pattern
{noformat}

The value for these properties should be a comma separated Java 
regular-expressions. Tables whose qualifiedName attribute matches the specified 
reg-ex patterns will be ignored/pruned. Note that qualifiedName of hive_table 
entities are formed as: dbName.tableName@clusterName. Here are few sample 
values:

{noformat}
atlas.hook.hive.hive_table.ignore.pattern=temp_db\\..*,temp_db2\\..*
atlas.hook.hive.hive_table.prune.pattern=staging_db\\..*
{noformat}

Note that "." is a special reg-ex character, hence had to be escaped with a 
back-slash. For more details on Java regular-expressions 
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. 

> Option to ignore/prune metadata for temporary/staging Hive tables
> -----------------------------------------------------------------
>
>                 Key: ATLAS-3006
>                 URL: https://issues.apache.org/jira/browse/ATLAS-3006
>             Project: Atlas
>          Issue Type: Improvement
>          Components:  atlas-core
>            Reporter: Madhan Neethiraj
>            Assignee: Madhan Neethiraj
>            Priority: Major
>             Fix For: 0.8.4, 1.2.0, 2.0.0
>
>         Attachments: ATLAS-3006-branch-0.8.patch, ATLAS-3006.patch
>
>
> It is not uncommon for a Hive deployment to use a large number of 
> staging/temporary tables, which are created periodically to load data into 
> target tables and deleted after completion of data load. A large number of 
> entities are created in Atlas for these staging/temporary tables 
> (tables/columns/column-lineage).
> For staging tables, it is probably not useful to track details like columns 
> and column-lineage in Atlas. Not tracking these details in Atlas can 
> significantly reduce the time it takes to process notifications, and can help 
> in improving the performance overall. Only minimum details of these staging 
> tables can be stored in Atlas, to capture data lineage from source to target 
> table via all intermediate staging tables.
> Also, it will be helpful to good to ignore tables that are created & deleted 
> during data loading i.e. temporary tables.
> Configurations should be provided to specify which of the tables are 
> staging/temporary. In addition to supporting this in Hive hook (to avoid 
> generation of large messages for staging/temporary tables), Atlas server 
> should also be updated, to control this further at server side while 
> processing notifications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to