Repository: atlas Updated Branches: refs/heads/master 1fc88ce30 -> 880ea4b69
ATLAS-2647: updated documentation on notification, hooks and basic-search Project: http://git-wip-us.apache.org/repos/asf/atlas/repo Commit: http://git-wip-us.apache.org/repos/asf/atlas/commit/880ea4b6 Tree: http://git-wip-us.apache.org/repos/asf/atlas/tree/880ea4b6 Diff: http://git-wip-us.apache.org/repos/asf/atlas/diff/880ea4b6 Branch: refs/heads/master Commit: 880ea4b693d79be6072da5e64153e4c97f21ea02 Parents: 1fc88ce Author: Madhan Neethiraj <mad...@apache.org> Authored: Sat May 5 17:07:02 2018 -0700 Committer: Madhan Neethiraj <mad...@apache.org> Committed: Mon May 7 10:02:13 2018 -0700 ---------------------------------------------------------------------- docs/src/site/resources/images/add.gif | Bin 397 -> 0 bytes .../resources/images/apache-incubator-logo.png | Bin 4234 -> 0 bytes .../resources/images/apache-maven-project-2.png | Bin 33442 -> 0 bytes docs/src/site/resources/images/fix.gif | Bin 366 -> 0 bytes .../site/resources/images/icon_error_sml.gif | Bin 633 -> 0 bytes .../src/site/resources/images/icon_help_sml.gif | Bin 1072 -> 0 bytes .../src/site/resources/images/icon_info_sml.gif | Bin 638 -> 0 bytes .../site/resources/images/icon_success_sml.gif | Bin 604 -> 0 bytes .../site/resources/images/icon_warning_sml.gif | Bin 625 -> 0 bytes .../images/logos/build-by-maven-black.png | Bin 2294 -> 0 bytes .../images/logos/build-by-maven-white.png | Bin 2260 -> 0 bytes .../resources/images/profiles/pre-release.png | Bin 32607 -> 0 bytes .../site/resources/images/profiles/retired.png | Bin 22003 -> 0 bytes .../site/resources/images/profiles/sandbox.png | Bin 33010 -> 0 bytes docs/src/site/resources/images/remove.gif | Bin 607 -> 0 bytes docs/src/site/resources/images/rss.png | Bin 474 -> 0 bytes .../twiki/search-basic-hive_column-PII.png | Bin 0 -> 502513 bytes ...h-basic-hive_table-customers-or-provider.png | Bin 0 -> 373583 bytes ...basic-hive_table-customers-owner_is_hive.png | Bin 0 -> 366589 bytes .../twiki/search-basic-hive_table-customers.png | Bin 0 -> 305538 bytes docs/src/site/resources/images/update.gif | Bin 1090 -> 0 bytes docs/src/site/twiki/Architecture.twiki | 10 +- docs/src/site/twiki/Bridge-Falcon.twiki | 52 ------- docs/src/site/twiki/Bridge-HBase.twiki | 62 -------- docs/src/site/twiki/Bridge-Hive.twiki | 116 --------------- docs/src/site/twiki/Bridge-Kafka.twiki | 49 ++++--- docs/src/site/twiki/Bridge-Sqoop.twiki | 42 ------ docs/src/site/twiki/Hook-Falcon.twiki | 52 +++++++ docs/src/site/twiki/Hook-HBase.twiki | 70 +++++++++ docs/src/site/twiki/Hook-Hive.twiki | 132 +++++++++++++++++ docs/src/site/twiki/Hook-Sqoop.twiki | 60 ++++++++ docs/src/site/twiki/Hook-Storm.twiki | 114 +++++++++++++++ docs/src/site/twiki/Notification-Entity.twiki | 33 ----- docs/src/site/twiki/Notifications.twiki | 73 ++++++++++ docs/src/site/twiki/Search-Basic.twiki | 142 +++++++++---------- docs/src/site/twiki/StormAtlasHook.twiki | 114 --------------- docs/src/site/twiki/index.twiki | 20 +-- 37 files changed, 611 insertions(+), 530 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/add.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/add.gif b/docs/src/site/resources/images/add.gif deleted file mode 100755 index 1cb3dbf..0000000 Binary files a/docs/src/site/resources/images/add.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/apache-incubator-logo.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/apache-incubator-logo.png b/docs/src/site/resources/images/apache-incubator-logo.png deleted file mode 100755 index 81fb31e..0000000 Binary files a/docs/src/site/resources/images/apache-incubator-logo.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/apache-maven-project-2.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/apache-maven-project-2.png b/docs/src/site/resources/images/apache-maven-project-2.png deleted file mode 100755 index 6c096ec..0000000 Binary files a/docs/src/site/resources/images/apache-maven-project-2.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/fix.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/fix.gif b/docs/src/site/resources/images/fix.gif deleted file mode 100755 index b7eb3dc..0000000 Binary files a/docs/src/site/resources/images/fix.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/icon_error_sml.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/icon_error_sml.gif b/docs/src/site/resources/images/icon_error_sml.gif deleted file mode 100755 index 12e9a01..0000000 Binary files a/docs/src/site/resources/images/icon_error_sml.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/icon_help_sml.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/icon_help_sml.gif b/docs/src/site/resources/images/icon_help_sml.gif deleted file mode 100755 index aaf20e6..0000000 Binary files a/docs/src/site/resources/images/icon_help_sml.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/icon_info_sml.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/icon_info_sml.gif b/docs/src/site/resources/images/icon_info_sml.gif deleted file mode 100755 index b776326..0000000 Binary files a/docs/src/site/resources/images/icon_info_sml.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/icon_success_sml.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/icon_success_sml.gif b/docs/src/site/resources/images/icon_success_sml.gif deleted file mode 100755 index 0a19527..0000000 Binary files a/docs/src/site/resources/images/icon_success_sml.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/icon_warning_sml.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/icon_warning_sml.gif b/docs/src/site/resources/images/icon_warning_sml.gif deleted file mode 100755 index ac6ad6a..0000000 Binary files a/docs/src/site/resources/images/icon_warning_sml.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/logos/build-by-maven-black.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/logos/build-by-maven-black.png b/docs/src/site/resources/images/logos/build-by-maven-black.png deleted file mode 100755 index 919fd0f..0000000 Binary files a/docs/src/site/resources/images/logos/build-by-maven-black.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/logos/build-by-maven-white.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/logos/build-by-maven-white.png b/docs/src/site/resources/images/logos/build-by-maven-white.png deleted file mode 100755 index 7d44c9c..0000000 Binary files a/docs/src/site/resources/images/logos/build-by-maven-white.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/profiles/pre-release.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/profiles/pre-release.png b/docs/src/site/resources/images/profiles/pre-release.png deleted file mode 100755 index d448e85..0000000 Binary files a/docs/src/site/resources/images/profiles/pre-release.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/profiles/retired.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/profiles/retired.png b/docs/src/site/resources/images/profiles/retired.png deleted file mode 100755 index f89f6a2..0000000 Binary files a/docs/src/site/resources/images/profiles/retired.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/profiles/sandbox.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/profiles/sandbox.png b/docs/src/site/resources/images/profiles/sandbox.png deleted file mode 100755 index f88b362..0000000 Binary files a/docs/src/site/resources/images/profiles/sandbox.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/remove.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/remove.gif b/docs/src/site/resources/images/remove.gif deleted file mode 100755 index fc65631..0000000 Binary files a/docs/src/site/resources/images/remove.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/rss.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/rss.png b/docs/src/site/resources/images/rss.png deleted file mode 100755 index a9850ee..0000000 Binary files a/docs/src/site/resources/images/rss.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/twiki/search-basic-hive_column-PII.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/twiki/search-basic-hive_column-PII.png b/docs/src/site/resources/images/twiki/search-basic-hive_column-PII.png new file mode 100644 index 0000000..49ff7b6 Binary files /dev/null and b/docs/src/site/resources/images/twiki/search-basic-hive_column-PII.png differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/twiki/search-basic-hive_table-customers-or-provider.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/twiki/search-basic-hive_table-customers-or-provider.png b/docs/src/site/resources/images/twiki/search-basic-hive_table-customers-or-provider.png new file mode 100644 index 0000000..10011cf Binary files /dev/null and b/docs/src/site/resources/images/twiki/search-basic-hive_table-customers-or-provider.png differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/twiki/search-basic-hive_table-customers-owner_is_hive.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/twiki/search-basic-hive_table-customers-owner_is_hive.png b/docs/src/site/resources/images/twiki/search-basic-hive_table-customers-owner_is_hive.png new file mode 100644 index 0000000..6a9e775 Binary files /dev/null and b/docs/src/site/resources/images/twiki/search-basic-hive_table-customers-owner_is_hive.png differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/twiki/search-basic-hive_table-customers.png ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/twiki/search-basic-hive_table-customers.png b/docs/src/site/resources/images/twiki/search-basic-hive_table-customers.png new file mode 100644 index 0000000..784e233 Binary files /dev/null and b/docs/src/site/resources/images/twiki/search-basic-hive_table-customers.png differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/resources/images/update.gif ---------------------------------------------------------------------- diff --git a/docs/src/site/resources/images/update.gif b/docs/src/site/resources/images/update.gif deleted file mode 100755 index b2a6d0b..0000000 Binary files a/docs/src/site/resources/images/update.gif and /dev/null differ http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Architecture.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Architecture.twiki b/docs/src/site/twiki/Architecture.twiki index d0f1a05..654dbdf 100755 --- a/docs/src/site/twiki/Architecture.twiki +++ b/docs/src/site/twiki/Architecture.twiki @@ -48,11 +48,11 @@ notification events. Events are written by the hooks and Atlas to different Kafk Atlas supports integration with many sources of metadata out of the box. More integrations will be added in future as well. Currently, Atlas supports ingesting and managing metadata from the following sources: - * [[Bridge-Hive][Hive]] - * [[Bridge-Sqoop][Sqoop]] - * [[Bridge-Falcon][Falcon]] - * [[StormAtlasHook][Storm]] - * HBase - _documentation work-in-progress_ + * [[Hook-HBase][HBase]] + * [[Hook-Hive][Hive]] + * [[Hook-Sqoop][Sqoop]] + * [[Hook-Storm][Storm]] + * [[Bridge-Kafka][Kafka]] The integration implies two things: There are metadata models that Atlas defines natively to represent objects of these components. http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Bridge-Falcon.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Bridge-Falcon.twiki b/docs/src/site/twiki/Bridge-Falcon.twiki deleted file mode 100644 index 0cf1645..0000000 --- a/docs/src/site/twiki/Bridge-Falcon.twiki +++ /dev/null @@ -1,52 +0,0 @@ ----+ Falcon Atlas Bridge - ----++ Falcon Model -The default hive model includes the following types: - * Entity types: - * falcon_cluster - * super-types: Infrastructure - * attributes: timestamp, colo, owner, tags - * falcon_feed - * super-types: !DataSet - * attributes: timestamp, stored-in, owner, groups, tags - * falcon_feed_creation - * super-types: Process - * attributes: timestamp, stored-in, owner - * falcon_feed_replication - * super-types: Process - * attributes: timestamp, owner - * falcon_process - * super-types: Process - * attributes: timestamp, runs-on, owner, tags, pipelines, workflow-properties - -One falcon_process entity is created for every cluster that the falcon process is defined for. - -The entities are created and de-duped using unique qualifiedName attribute. They provide namespace and can be used for querying/lineage as well. The unique attributes are: - * falcon_process.qualifiedName - <process name>@<cluster name> - * falcon_cluster.qualifiedName - <cluster name> - * falcon_feed.qualifiedName - <feed name>@<cluster name> - * falcon_feed_creation.qualifiedName - <feed name> - * falcon_feed_replication.qualifiedName - <feed name> - ----++ Falcon Hook -Falcon supports listeners on falcon entity submission. This is used to add entities in Atlas using the model detailed above. -Follow the instructions below to setup Atlas hook in Falcon: - * Add 'org.apache.atlas.falcon.service.AtlasService' to application.services in <falcon-conf>/startup.properties - * Link Atlas hook jars in Falcon classpath - 'ln -s <atlas-home>/hook/falcon/* <falcon-home>/server/webapp/falcon/WEB-INF/lib/' - * In <falcon_conf>/falcon-env.sh, set an environment variable as follows: - <verbatim> - export FALCON_SERVER_OPTS="<atlas_home>/hook/falcon/*:$FALCON_SERVER_OPTS"</verbatim> - -The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details: - * atlas.hook.falcon.synchronous - boolean, true to run the hook synchronously. default false - * atlas.hook.falcon.numRetries - number of retries for notification failure. default 3 - * atlas.hook.falcon.minThreads - core number of threads. default 5 - * atlas.hook.falcon.maxThreads - maximum number of threads. default 5 - * atlas.hook.falcon.keepAliveTime - keep alive time in msecs. default 10 - * atlas.hook.falcon.queueSize - queue size for the threadpool. default 10000 - -Refer [[Configuration][Configuration]] for notification related configurations - - ----++ NOTES - * In falcon cluster entity, cluster name used should be uniform across components like hive, falcon, sqoop etc. If used with ambari, ambari cluster name should be used for cluster entity http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Bridge-HBase.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Bridge-HBase.twiki b/docs/src/site/twiki/Bridge-HBase.twiki deleted file mode 100644 index 7a5c908..0000000 --- a/docs/src/site/twiki/Bridge-HBase.twiki +++ /dev/null @@ -1,62 +0,0 @@ ----+ HBase Atlas Bridge - ----++ HBase Model -The default HBase model includes the following types: - * Entity types: - * hbase_namespace - * super-types: !Asset - * attributes: name, owner, description, type, classifications, term, clustername, parameters, createtime, modifiedtime, qualifiedName - * hbase_table - * super-types: !DataSet - * attributes: name, owner, description, type, classifications, term, uri, column_families, namespace, parameters, createtime, modifiedtime, maxfilesize, - isReadOnly, isCompactionEnabled, isNormalizationEnabled, ReplicaPerRegion, Durability, qualifiedName - * hbase_column_family - * super-types: !DataSet - * attributes: name, owner, description, type, classifications, term, columnns, createtime, bloomFilterType, compressionType, CompactionCompressionType, EncryptionType, - inMemoryCompactionPolicy, keepDeletedCells, Maxversions, MinVersions, datablockEncoding, storagePolicy, Ttl, blockCachedEnabled, cacheBloomsOnWrite, - cacheDataOnWrite, EvictBlocksOnClose, PerfectBlocksOnOpen, NewVersionsBehavior, isMobEnbaled, MobCompactPartitionPolicy, qualifiedName - -The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well: - * hbase_namespace.qualifiedName - <namespace>@<clusterName> - * hbase_table.qualifiedName - <namespace>:<tableName>@<clusterName> - * hbase_column_family.qualifiedName - <namespace>:<tableName>.<columnFamily>@<clusterName> - - ----++ Importing HBase Metadata -org.apache.atlas.hbase.bridge.HBaseBridge imports the HBase metadata into Atlas using the model defined above. import-hbase.sh command can be used to facilitate this. - <verbatim> - Usage 1: <atlas package>/hook-bin/import-hbase.sh - Usage 2: <atlas package>/hook-bin/import-hbase.sh [-n <namespace regex> OR --namespace <namespace regex >] [-t <table regex > OR --table <table regex>] - Usage 3: <atlas package>/hook-bin/import-hbase.sh [-f <filename>] - File Format: - namespace1:tbl1 - namespace1:tbl2 - namespace2:tbl1 - </verbatim> - -The logs are in <atlas package>/logs/import-hbase.log - ----++ HBase Hook -Atlas HBase hook registers with HBase to listen for create/update/delete operations and updates the metadata in Atlas, via Kafka notifications, for the changes in HBase. -Follow the instructions below to setup Atlas hook in HBase: - * Set-up Atlas hook in hbase-site.xml by adding the following: - <verbatim> - <property> - <name>hbase.coprocessor.master.classes</name> - <value>org.apache.atlas.hbase.hook.HBaseAtlasCoprocessor</value> - </property></verbatim> - * Copy <atlas package>/hook/hbase/<All files and folder> to hbase class path. HBase hook binary files are present in apache-atlas-<release-vesion>-SNAPSHOT-hbase-hook.tar.gz - * Copy <atlas-conf>/atlas-application.properties to the hbase conf directory. - -The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details: - * atlas.hook.hbase.synchronous - boolean, true to run the hook synchronously. default false. Recommended to be set to false to avoid delays in Hbase operation. - * atlas.hook.hbase.numRetries - number of retries for notification failure. default 3 - * atlas.hook.hbase.minThreads - core number of threads. default 1 - * atlas.hook.hbase.maxThreads - maximum number of threads. default 5 - * atlas.hook.hbase.keepAliveTime - keep alive time in msecs. default 10 - * atlas.hook.hbase.queueSize - queue size for the threadpool. default 10000 - -Refer [[Configuration][Configuration]] for notification related configurations - ----++ NOTES - * Only the namespace, table and columnfamily create / update / delete operations are caputured by the hook. Columns changes wont be captured and propagated. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Bridge-Hive.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Bridge-Hive.twiki b/docs/src/site/twiki/Bridge-Hive.twiki deleted file mode 100644 index 7c93ecd..0000000 --- a/docs/src/site/twiki/Bridge-Hive.twiki +++ /dev/null @@ -1,116 +0,0 @@ ----+ Hive Atlas Bridge - ----++ Hive Model -The default hive model includes the following types: - * Entity types: - * hive_db - * super-types: Referenceable - * attributes: name, clusterName, description, locationUri, parameters, ownerName, ownerType - * hive_storagedesc - * super-types: Referenceable - * attributes: cols, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories - * hive_column - * super-types: Referenceable - * attributes: name, type, comment, table - * hive_table - * super-types: !DataSet - * attributes: name, db, owner, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary - * hive_process - * super-types: Process - * attributes: name, startTime, endTime, userName, operationType, queryText, queryPlan, queryId - * hive_column_lineage - * super-types: Process - * attributes: query, depenendencyType, expression - - * Enum types: - * hive_principal_type - * values: USER, ROLE, GROUP - - * Struct types: - * hive_order - * attributes: col, order - * hive_serde - * attributes: name, serializationLib, parameters - -The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying/lineage as well. Note that dbName, tableName and columnName should be in lower case. clusterName is explained below. - * hive_db.qualifiedName - <dbName>@<clusterName> - * hive_table.qualifiedName - <dbName>.<tableName>@<clusterName> - * hive_column.qualifiedName - <dbName>.<tableName>.<columnName>@<clusterName> - * hive_process.queryString - trimmed query string in lower case - - ----++ Importing Hive Metadata -org.apache.atlas.hive.bridge.HiveMetaStoreBridge imports the Hive metadata into Atlas using the model defined above. import-hive.sh command can be used to facilitate this. - <verbatim> - Usage: <atlas package>/hook-bin/import-hive.sh</verbatim> - -The logs are in <atlas package>/logs/import-hive.log - - ----++ Hive Hook -Atlas Hive hook registers with Hive to listen for create/update/delete operations and updates the metadata in Atlas, via Kafka notifications, for the changes in Hive. -Follow the instructions below to setup Atlas hook in Hive: - * Set-up Atlas hook in hive-site.xml by adding the following: - <verbatim> - <property> - <name>hive.exec.post.hooks</name> - <value>org.apache.atlas.hive.hook.HiveHook</value> - </property></verbatim> - * Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration - * Copy <atlas-conf>/atlas-application.properties to the hive conf directory. - -The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details: - * atlas.hook.hive.synchronous - boolean, true to run the hook synchronously. default false. Recommended to be set to false to avoid delays in hive query completion. - * atlas.hook.hive.numRetries - number of retries for notification failure. default 3 - * atlas.hook.hive.minThreads - core number of threads. default 1 - * atlas.hook.hive.maxThreads - maximum number of threads. default 5 - * atlas.hook.hive.keepAliveTime - keep alive time in msecs. default 10 - * atlas.hook.hive.queueSize - queue size for the threadpool. default 10000 - -Refer [[Configuration][Configuration]] for notification related configurations - ----++ Column Level Lineage - -Starting from 0.8-incubating version of Atlas, Column level lineage is captured in Atlas. Below are the details - ----+++ Model - * !ColumnLineageProcess type is a subtype of Process - - * This relates an output Column to a set of input Columns or the Input Table - - * The lineage also captures the kind of dependency, as listed below: - * SIMPLE: output column has the same value as the input - * EXPRESSION: output column is transformed by some expression at runtime (for e.g. a Hive SQL expression) on the Input Columns. - * SCRIPT: output column is transformed by a user provided script. - - * In case of EXPRESSION dependency the expression attribute contains the expression in string form - - * Since Process links input and output !DataSets, Column is a subtype of !DataSet - ----+++ Examples -For a simple CTAS below: -<verbatim> -create table t2 as select id, name from T1</verbatim> - -The lineage is captured as - -<img src="images/column_lineage_ex1.png" height="200" width="400" /> - - - ----+++ Extracting Lineage from Hive commands - * The !HiveHook maps the !LineageInfo in the !HookContext to Column lineage instances - - * The !LineageInfo in Hive provides column-level lineage for the final !FileSinkOperator, linking them to the input columns in the Hive Query - ----++ NOTES - * Column level lineage works with Hive version 1.2.1 after the patch for <a href="https://issues.apache.org/jira/browse/HIVE-13112">HIVE-13112</a> is applied to Hive source - * Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names - * The following hive operations are captured by hive hook currently - * create database - * create table/view, create table as select - * load, import, export - * DMLs (insert) - * alter database - * alter table (skewed table information, stored as, protection is not supported) - * alter view http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Bridge-Kafka.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Bridge-Kafka.twiki b/docs/src/site/twiki/Bridge-Kafka.twiki index 7cdd548..0a0ed1c 100644 --- a/docs/src/site/twiki/Bridge-Kafka.twiki +++ b/docs/src/site/twiki/Bridge-Kafka.twiki @@ -1,34 +1,37 @@ ----+ Kafka Atlas Bridge +---+ Apache Atlas Hook for Apache Kafka ---++ Kafka Model -The default Kafka model includes the following types: +Kafka model includes the following types: * Entity types: * kafka_topic * super-types: !DataSet - * attributes: name, owner, description, type, classifications, term, clustername, topic , partitionCount, qualifiedName + * attributes: qualifiedName, name, description, owner, topic, uri, partitionCount -The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well: - * topic.qualifiedName - <topic>@<clusterName> +Kafka entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below. +Note that qualifiedName will have topic name in lower case. +<verbatim> + topic.qualifiedName: <topic>@<clusterName> +</verbatim> ---++ Setup - binary files are present in apache-atlas-<release-vesion>-SNAPSHOT-kafka-hook.tar.gz - Copy apache-atlas-kafka-hook-<release-verion>-SNAPSHOT/hook/kafka folder to <atlas package>/hook/ directory - Copy apache-atlas-kafka-hook-<release-verion>-SNAPSHOT/hook-bin folder to <atlas package>/hook-bin/ directory + Binary files are present in apache-atlas-<release-version>-kafka-hook.tar.gz + + Copy apache-atlas-kafka-hook-<release-version>/hook/kafka folder to <atlas package>/hook/ directory + + Copy apache-atlas-kafka-hook-<release-version>/hook-bin folder to <atlas package>/hook-bin directory - * Copy <atlas-conf>/atlas-application.properties to the Kafka conf directory. ---++ Importing Kafka Metadata -org.apache.atlas.Kafka.bridge.KafkaBridge imports the Kafka metadata into Atlas using the model defined above. import-kafka.sh command can be used to facilitate this. - <verbatim> - Usage 1: <atlas package>/hook-bin/import-kafka.sh - Usage 2: <atlas package>/hook-bin/import-kafka.sh [-n <namespace regex> OR --namespace <namespace regex >] [-t <table regex > OR --table <table regex>] - Usage 3: <atlas package>/hook-bin/import-kafka.sh [-f <filename>] - File Format: - topic1 - topic2 - topic3 - </verbatim> - -The logs are in <atlas package>/logs/import-kafka.log - -Refer [[Configuration][Configuration]] for notification related configurations +Apache Atlas provides a command-line utility, import-kafka.sh, to import metadata of Apache Kafka topics into Apache Atlas. +This utility can be used to initialize Apache Atlas with topics present in Apache Kafka. +This utility supports importing metadata of a specific topic or all topics. + +<verbatim> +Usage 1: <atlas package>/hook-bin/import-kafka.sh +Usage 2: <atlas package>/hook-bin/import-kafka.sh [-t <topic prefix> OR --topic <topic prefix>] +Usage 3: <atlas package>/hook-bin/import-kafka.sh [-f <filename>] + File Format: + topic1 + topic2 + topic3 +</verbatim> http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Bridge-Sqoop.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Bridge-Sqoop.twiki b/docs/src/site/twiki/Bridge-Sqoop.twiki deleted file mode 100644 index 480578b..0000000 --- a/docs/src/site/twiki/Bridge-Sqoop.twiki +++ /dev/null @@ -1,42 +0,0 @@ ----+ Sqoop Atlas Bridge - ----++ Sqoop Model -The default hive model includes the following types: - * Entity types: - * sqoop_process - * super-types: Process - * attributes: name, operation, dbStore, hiveTable, commandlineOpts, startTime, endTime, userName - * sqoop_dbdatastore - * super-types: !DataSet - * attributes: name, dbStoreType, storeUse, storeUri, source, description, ownerName - - * Enum types: - * sqoop_operation_type - * values: IMPORT, EXPORT, EVAL - * sqoop_dbstore_usage - * values: TABLE, QUERY, PROCEDURE, OTHER - -The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well: - * sqoop_process.qualifiedName - dbStoreType-storeUri-endTime - * sqoop_dbdatastore.qualifiedName - dbStoreType-storeUri-source - ----++ Sqoop Hook -Sqoop added a !SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in !SqoopHook. -This is used to add entities in Atlas using the model detailed above. - -Follow the instructions below to setup Atlas hook in Hive: - -Add the following properties to to enable Atlas hook in Sqoop: - * Set-up Atlas hook in <sqoop-conf>/sqoop-site.xml by adding the following: - <verbatim> - <property> - <name>sqoop.job.data.publish.class</name> - <value>org.apache.atlas.sqoop.hook.SqoopHook</value> - </property></verbatim> - * Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/ - * Link <atlas-home>/hook/sqoop/*.jar in sqoop lib - -Refer [[Configuration][Configuration]] for notification related configurations - ----++ NOTES - * Only the following sqoop operations are captured by sqoop hook currently - hiveImport http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Hook-Falcon.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Hook-Falcon.twiki b/docs/src/site/twiki/Hook-Falcon.twiki new file mode 100644 index 0000000..0cf1645 --- /dev/null +++ b/docs/src/site/twiki/Hook-Falcon.twiki @@ -0,0 +1,52 @@ +---+ Falcon Atlas Bridge + +---++ Falcon Model +The default hive model includes the following types: + * Entity types: + * falcon_cluster + * super-types: Infrastructure + * attributes: timestamp, colo, owner, tags + * falcon_feed + * super-types: !DataSet + * attributes: timestamp, stored-in, owner, groups, tags + * falcon_feed_creation + * super-types: Process + * attributes: timestamp, stored-in, owner + * falcon_feed_replication + * super-types: Process + * attributes: timestamp, owner + * falcon_process + * super-types: Process + * attributes: timestamp, runs-on, owner, tags, pipelines, workflow-properties + +One falcon_process entity is created for every cluster that the falcon process is defined for. + +The entities are created and de-duped using unique qualifiedName attribute. They provide namespace and can be used for querying/lineage as well. The unique attributes are: + * falcon_process.qualifiedName - <process name>@<cluster name> + * falcon_cluster.qualifiedName - <cluster name> + * falcon_feed.qualifiedName - <feed name>@<cluster name> + * falcon_feed_creation.qualifiedName - <feed name> + * falcon_feed_replication.qualifiedName - <feed name> + +---++ Falcon Hook +Falcon supports listeners on falcon entity submission. This is used to add entities in Atlas using the model detailed above. +Follow the instructions below to setup Atlas hook in Falcon: + * Add 'org.apache.atlas.falcon.service.AtlasService' to application.services in <falcon-conf>/startup.properties + * Link Atlas hook jars in Falcon classpath - 'ln -s <atlas-home>/hook/falcon/* <falcon-home>/server/webapp/falcon/WEB-INF/lib/' + * In <falcon_conf>/falcon-env.sh, set an environment variable as follows: + <verbatim> + export FALCON_SERVER_OPTS="<atlas_home>/hook/falcon/*:$FALCON_SERVER_OPTS"</verbatim> + +The following properties in <atlas-conf>/atlas-application.properties control the thread pool and notification details: + * atlas.hook.falcon.synchronous - boolean, true to run the hook synchronously. default false + * atlas.hook.falcon.numRetries - number of retries for notification failure. default 3 + * atlas.hook.falcon.minThreads - core number of threads. default 5 + * atlas.hook.falcon.maxThreads - maximum number of threads. default 5 + * atlas.hook.falcon.keepAliveTime - keep alive time in msecs. default 10 + * atlas.hook.falcon.queueSize - queue size for the threadpool. default 10000 + +Refer [[Configuration][Configuration]] for notification related configurations + + +---++ NOTES + * In falcon cluster entity, cluster name used should be uniform across components like hive, falcon, sqoop etc. If used with ambari, ambari cluster name should be used for cluster entity http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Hook-HBase.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Hook-HBase.twiki b/docs/src/site/twiki/Hook-HBase.twiki new file mode 100644 index 0000000..4aa9703 --- /dev/null +++ b/docs/src/site/twiki/Hook-HBase.twiki @@ -0,0 +1,70 @@ +---+ Apache Atlas Hook & Bridge for Apache HBase + +---++ HBase Model +HBase model includes the following types: + * Entity types: + * hbase_namespace + * super-types: !Asset + * attributes: qualifiedName, name, description, owner, clusterName, parameters, createTime, modifiedTime + * hbase_table + * super-types: !DataSet + * attributes: qualifiedName, name, description, owner, namespace, column_families, uri, parameters, createtime, modifiedtime, maxfilesize, isReadOnly, isCompactionEnabled, isNormalizationEnabled, ReplicaPerRegion, Durability + * hbase_column_family + * super-types: !DataSet + * attributes: qualifiedName, name, description, owner, columns, createTime, bloomFilterType, compressionType, compactionCompressionType, encryptionType, inMemoryCompactionPolicy, keepDeletedCells, maxversions, minVersions, datablockEncoding, storagePolicy, ttl, blockCachedEnabled, cacheBloomsOnWrite, cacheDataOnWrite, evictBlocksOnClose, prefetchBlocksOnOpen, newVersionsBehavior, isMobEnabled, mobCompactPartitionPolicy + +HBase entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below. Note that namespaceName, tableName and columnFamilyName should be in lower case. +<verbatim> + hbase_namespace.qualifiedName: <namespaceName>@<clusterName> + hbase_table.qualifiedName: <namespaceName>:<tableName>@<clusterName> + hbase_column_family.qualifiedName: <namespaceName>:<tableName>.<columnFamilyName>@<clusterName> +</verbatim> + + +---++ HBase Hook +Atlas HBase hook registers with HBase master as a co-processor. On detecting changes to HBase namespaces/tables/column-families, Atlas hook updates the metadata in Atlas via Kafka notifications. +Follow the instructions below to setup Atlas hook in HBase: + * Register Atlas hook in hbase-site.xml by adding the following: + <verbatim> + <property> + <name>hbase.coprocessor.master.classes</name> + <value>org.apache.atlas.hbase.hook.HBaseAtlasCoprocessor</value> + </property></verbatim> + * Copy entire contents of folder <atlas package>/hook/hbase to HBase class path. + * Copy <atlas-conf>/atlas-application.properties to the HBase conf directory. + +The following properties in atlas-application.properties control the thread pool and notification details: +<verbatim> +atlas.hook.hbase.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in HBase operations. Default: false +atlas.hook.hbase.numRetries=3 # number of retries for notification failure. Default: 3 +atlas.hook.hbase.queueSize=10000 # queue size for the threadpool. Default: 10000 + +atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary + +atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181 +atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000 +atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000 +atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20 +</verbatim> + +Other configurations for Kafka notification producer can be specified by prefixing the configuration name with "atlas.kafka.". +For list of configuration supported by Kafka producer, please refer to [[http://kafka.apache.org/documentation/#producerconfigs][Kafka Producer Configs]] + +---++ NOTES + * Only the namespace, table and column-family create/update/ delete operations are captured by Atlas HBase hook. Changes to columns are be captured. + + +---++ Importing HBase Metadata +Apache Atlas provides a command-line utility, import-hbase.sh, to import metadata of Apache HBase namespaces and tables into Apache Atlas. +This utility can be used to initialize Apache Atlas with namespaces/tables present in a Apache HBase cluster. +This utility supports importing metadata of a specific table, tables in a specific namespace or all tables. + +<verbatim> +Usage 1: <atlas package>/hook-bin/import-hbase.sh +Usage 2: <atlas package>/hook-bin/import-hbase.sh [-n <namespace regex> OR --namespace <namespace regex>] [-t <table regex> OR --table <table regex>] +Usage 3: <atlas package>/hook-bin/import-hbase.sh [-f <filename>] + File Format: + namespace1:tbl1 + namespace1:tbl2 + namespace2:tbl1 +</verbatim> http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Hook-Hive.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Hook-Hive.twiki b/docs/src/site/twiki/Hook-Hive.twiki new file mode 100644 index 0000000..9b2272d --- /dev/null +++ b/docs/src/site/twiki/Hook-Hive.twiki @@ -0,0 +1,132 @@ +---+ Apache Atlas Hook & Bridge for Apache Hive + +---++ Hive Model +Hive model includes the following types: + * Entity types: + * hive_db + * super-types: !Asset + * attributes: qualifiedName, name, description, owner, clusterName, location, parameters, ownerName + * hive_table + * super-types: !DataSet + * attributes: qualifiedName, name, description, owner, db, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary + * hive_column + * super-types: !DataSet + * attributes: qualifiedName, name, description, owner, type, comment, table + * hive_storagedesc + * super-types: Referenceable + * attributes: qualifiedName, table, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories + * hive_process + * super-types: Process + * attributes: qualifiedName, name, description, owner, inputs, outputs, startTime, endTime, userName, operationType, queryText, queryPlan, queryId, clusterName + * hive_column_lineage + * super-types: Process + * attributes: qualifiedName, name, description, owner, inputs, outputs, query, depenendencyType, expression + + * Enum types: + * hive_principal_type + * values: USER, ROLE, GROUP + + * Struct types: + * hive_order + * attributes: col, order + * hive_serde + * attributes: name, serializationLib, parameters + +Hive entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below. Note that dbName, tableName and columnName should be in lower case. +<verbatim> + hive_db.qualifiedName: <dbName>@<clusterName> + hive_table.qualifiedName: <dbName>.<tableName>@<clusterName> + hive_column.qualifiedName: <dbName>.<tableName>.<columnName>@<clusterName> + hive_process.queryString: trimmed query string in lower case +</verbatim> + + +---++ Hive Hook +Atlas Hive hook registers with Hive to listen for create/update/delete operations and updates the metadata in Atlas, via Kafka notifications, for the changes in Hive. +Follow the instructions below to setup Atlas hook in Hive: + * Set-up Atlas hook in hive-site.xml by adding the following: + <verbatim> + <property> + <name>hive.exec.post.hooks</name> + <value>org.apache.atlas.hive.hook.HiveHook</value> + </property></verbatim> + * Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration + * Copy <atlas-conf>/atlas-application.properties to the hive conf directory. + +The following properties in atlas-application.properties control the thread pool and notification details: +<verbatim> +atlas.hook.hive.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Hive query completion. Default: false +atlas.hook.hive.numRetries=3 # number of retries for notification failure. Default: 3 +atlas.hook.hive.queueSize=10000 # queue size for the threadpool. Default: 10000 + +atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary + +atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181 +atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000 +atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000 +atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20 +</verbatim> + +Other configurations for Kafka notification producer can be specified by prefixing the configuration name with "atlas.kafka.". For list of configuration supported by Kafka producer, please refer to [[http://kafka.apache.org/documentation/#producerconfigs][Kafka Producer Configs]] + +---++ Column Level Lineage + +Starting from 0.8-incubating version of Atlas, Column level lineage is captured in Atlas. Below are the details + +---+++ Model + * !ColumnLineageProcess type is a subtype of Process + + * This relates an output Column to a set of input Columns or the Input Table + + * The lineage also captures the kind of dependency, as listed below: + * SIMPLE: output column has the same value as the input + * EXPRESSION: output column is transformed by some expression at runtime (for e.g. a Hive SQL expression) on the Input Columns. + * SCRIPT: output column is transformed by a user provided script. + + * In case of EXPRESSION dependency the expression attribute contains the expression in string form + + * Since Process links input and output !DataSets, Column is a subtype of !DataSet + +---+++ Examples +For a simple CTAS below: +<verbatim> +create table t2 as select id, name from T1</verbatim> + +The lineage is captured as + +<img src="images/column_lineage_ex1.png" height="200" width="400" /> + + + +---+++ Extracting Lineage from Hive commands + * The !HiveHook maps the !LineageInfo in the !HookContext to Column lineage instances + + * The !LineageInfo in Hive provides column-level lineage for the final !FileSinkOperator, linking them to the input columns in the Hive Query + +---++ NOTES + * Column level lineage works with Hive version 1.2.1 after the patch for <a href="https://issues.apache.org/jira/browse/HIVE-13112">HIVE-13112</a> is applied to Hive source + * Since database name, table name and column names are case insensitive in hive, the corresponding names in entities are lowercase. So, any search APIs should use lowercase while querying on the entity names + * The following hive operations are captured by hive hook currently + * create database + * create table/view, create table as select + * load, import, export + * DMLs (insert) + * alter database + * alter table (skewed table information, stored as, protection is not supported) + * alter view + + +---++ Importing Hive Metadata +Apache Atlas provides a command-line utility, import-hive.sh, to import metadata of Apache Hive databases and tables into Apache Atlas. +This utility can be used to initialize Apache Atlas with databases/tables present in Apache Hive. +This utility supports importing metadata of a specific table, tables in a specific database or all databases and tables. + +<verbatim> +Usage 1: <atlas package>/hook-bin/import-hive.sh +Usage 2: <atlas package>/hook-bin/import-hive.sh [-d <database regex> OR --database <database regex>] [-t <table regex> OR --table <table regex>] +Usage 3: <atlas package>/hook-bin/import-hive.sh [-f <filename>] + File Format: + database1:tbl1 + database1:tbl2 + database2:tbl1 +</verbatim> http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Hook-Sqoop.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Hook-Sqoop.twiki b/docs/src/site/twiki/Hook-Sqoop.twiki new file mode 100644 index 0000000..788f46b --- /dev/null +++ b/docs/src/site/twiki/Hook-Sqoop.twiki @@ -0,0 +1,60 @@ +---+ Apache Atlas Hook for Apache Sqoop + +---++ Sqoop Model +Sqoop model includes the following types: + * Entity types: + * sqoop_process + * super-types: Process + * attributes: qualifiedName, name, description, owner, inputs, outputs, operation, commandlineOpts, startTime, endTime, userName + * sqoop_dbdatastore + * super-types: !DataSet + * attributes: qualifiedName, name, description, owner, dbStoreType, storeUse, storeUri, source + + * Enum types: + * sqoop_operation_type + * values: IMPORT, EXPORT, EVAL + * sqoop_dbstore_usage + * values: TABLE, QUERY, PROCEDURE, OTHER + +Sqoop entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below. +<verbatim> + sqoop_process.qualifiedName: sqoop <operation> --connect <url> {[--table <tableName>] || [--database <databaseName>]} [--query <storeQuery>] + sqoop_dbdatastore.qualifiedName: <storeType> --url <storeUri> {[--table <tableName>] || [--database <databaseName>]} [--query <storeQuery>] --hive-<operation> --hive-database <databaseName> [--hive-table <tableName>] --hive-cluster <clusterName> +</verbatim> + +---++ Sqoop Hook +Sqoop added a !SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in !SqoopHook. +This is used to add entities in Atlas using the model detailed above. + +Follow the instructions below to setup Atlas hook in Hive: + +Add the following properties to to enable Atlas hook in Sqoop: + * Set-up Atlas hook in <sqoop-conf>/sqoop-site.xml by adding the following: + <verbatim> + <property> + <name>sqoop.job.data.publish.class</name> + <value>org.apache.atlas.sqoop.hook.SqoopHook</value> + </property></verbatim> + * Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/ + * Link <atlas-home>/hook/sqoop/*.jar in sqoop lib + + +The following properties in atlas-application.properties control the thread pool and notification details: +<verbatim> +atlas.hook.sqoop.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Sqoop operation completion. Default: false +atlas.hook.sqoop.numRetries=3 # number of retries for notification failure. Default: 3 +atlas.hook.sqoop.queueSize=10000 # queue size for the threadpool. Default: 10000 + +atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary + +atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181 +atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000 +atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000 +atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20 +</verbatim> + +Other configurations for Kafka notification producer can be specified by prefixing the configuration name with "atlas.kafka.". For list of configuration supported by Kafka producer, please refer to [[http://kafka.apache.org/documentation/#producerconfigs][Kafka Producer Configs]] + +---++ NOTES + * Only the following sqoop operations are captured by sqoop hook currently + * hiveImport http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Hook-Storm.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Hook-Storm.twiki b/docs/src/site/twiki/Hook-Storm.twiki new file mode 100644 index 0000000..dffa35a --- /dev/null +++ b/docs/src/site/twiki/Hook-Storm.twiki @@ -0,0 +1,114 @@ +---+ Apache Atlas Hook for Apache Storm + +---++ Introduction + +Apache Storm is a distributed real-time computation system. Storm makes it +easy to reliably process unbounded streams of data, doing for real-time +processing what Hadoop did for batch processing. The process is essentially +a DAG of nodes, which is called *topology*. + +Apache Atlas is a metadata repository that enables end-to-end data lineage, +search and associate business classification. + +The goal of this integration is to push the operational topology +metadata along with the underlying data source(s), target(s), derivation +processes and any available business context so Atlas can capture the +lineage for this topology. + +There are 2 parts in this process detailed below: + * Data model to represent the concepts in Storm + * Storm Atlas Hook to update metadata in Atlas + + +---++ Storm Data Model + +A data model is represented as Types in Atlas. It contains the descriptions +of various nodes in the topology graph, such as spouts and bolts and the +corresponding producer and consumer types. + +The following types are added in Atlas. + + * storm_topology - represents the coarse-grained topology. A storm_topology derives from an Atlas Process type and hence can be used to inform Atlas about lineage. + * Following data sets are added - kafka_topic, jms_topic, hbase_table, hdfs_data_set. These all derive from an Atlas Dataset type and hence form the end points of a lineage graph. + * storm_spout - Data Producer having outputs, typically Kafka, JMS + * storm_bolt - Data Consumer having inputs and outputs, typically Hive, HBase, HDFS, etc. + +The Storm Atlas hook auto registers dependent models like the Hive data model +if it finds that these are not known to the Atlas server. + +The data model for each of the types is described in +the class definition at org.apache.atlas.storm.model.StormDataModel. + +---++ Storm Atlas Hook + +Atlas is notified when a new topology is registered successfully in +Storm. Storm provides a hook, backtype.storm.ISubmitterHook, at the Storm client used to +submit a storm topology. + +The Storm Atlas hook intercepts the hook post execution and extracts the metadata from the +topology and updates Atlas using the types defined. Atlas implements the +Storm client hook interface in org.apache.atlas.storm.hook.StormAtlasHook. + + +---++ Limitations + +The following apply for the first version of the integration. + + * Only new topology submissions are registered with Atlas, any lifecycle changes are not reflected in Atlas. + * The Atlas server needs to be online when a Storm topology is submitted for the metadata to be captured. + * The Hook currently does not support capturing lineage for custom spouts and bolts. + + +---++ Installation + +The Storm Atlas Hook needs to be manually installed in Storm on the client side. The hook +artifacts are available at: $ATLAS_PACKAGE/hook/storm + +Storm Atlas hook jars need to be copied to $STORM_HOME/extlib. +Replace STORM_HOME with storm installation path. + +Restart all daemons after you have installed the atlas hook into Storm. + + +---++ Configuration + +---+++ Storm Configuration + +The Storm Atlas Hook needs to be configured in Storm client config +in *$STORM_HOME/conf/storm.yaml* as: + +<verbatim> +storm.topology.submission.notifier.plugin.class: "org.apache.atlas.storm.hook.StormAtlasHook" +</verbatim> + +Also set a 'cluster name' that would be used as a namespace for objects registered in Atlas. +This name would be used for namespacing the Storm topology, spouts and bolts. + +The other objects like data sets should ideally be identified with the cluster name of +the components that generate them. For e.g. Hive tables and databases should be +identified using the cluster name set in Hive. The Storm Atlas hook will pick this up +if the Hive configuration is available in the Storm topology jar that is submitted on +the client and the cluster name is defined there. This happens similarly for HBase +data sets. In case this configuration is not available, the cluster name set in the Storm +configuration will be used. + +<verbatim> +atlas.cluster.name: "cluster_name" +</verbatim> + +In *$STORM_HOME/conf/storm_env.ini*, set an environment variable as follows: + +<verbatim> +STORM_JAR_JVM_OPTS:"-Datlas.conf=$ATLAS_HOME/conf/" +</verbatim> + +where ATLAS_HOME is pointing to where ATLAS is installed. + +You could also set this up programatically in Storm Config as: + +<verbatim> + Config stormConf = new Config(); + ... + stormConf.put(Config.STORM_TOPOLOGY_SUBMISSION_NOTIFIER_PLUGIN, + org.apache.atlas.storm.hook.StormAtlasHook.class.getName()); +</verbatim> http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Notification-Entity.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Notification-Entity.twiki b/docs/src/site/twiki/Notification-Entity.twiki deleted file mode 100644 index 9d883fc..0000000 --- a/docs/src/site/twiki/Notification-Entity.twiki +++ /dev/null @@ -1,33 +0,0 @@ ----+ Entity Change Notifications - -To receive Atlas entity notifications a consumer should be obtained through the notification interface. Entity change notifications are sent every time a change is made to an entity. Operations that result in an entity change notification are: - * <code>ENTITY_CREATE</code> - Create a new entity. - * <code>ENTITY_UPDATE</code> - Update an attribute of an existing entity. - * <code>TRAIT_ADD</code> - Add a trait to an entity. - * <code>TRAIT_DELETE</code> - Delete a trait from an entity. - - <verbatim> - // Obtain provider through injection⦠- Provider<NotificationInterface> provider; - - // Get the notification interface - NotificationInterface notification = provider.get(); - - // Create consumers - List<NotificationConsumer<EntityNotification>> consumers = - notification.createConsumers(NotificationInterface.NotificationType.ENTITIES, 1); -</verbatim> - - -The consumer exposes the Iterator interface that should be used to get the entity notifications as they are posted. The hasNext() method blocks until a notification is available. - -<verbatim> - while(consumer.hasNext()) { - EntityNotification notification = consumer.next(); - - IReferenceableInstance entity = notification.getEntity(); - ⦠- } -</verbatim> - - http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Notifications.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Notifications.twiki b/docs/src/site/twiki/Notifications.twiki new file mode 100644 index 0000000..fb1e694 --- /dev/null +++ b/docs/src/site/twiki/Notifications.twiki @@ -0,0 +1,73 @@ +---+ Notifications + +---++ Notifications from Apache Atlas +Apache Atlas sends notifications about metadata changes to Kafka topic named ATLAS_ENTITIES . +Applications interested in metadata changes can monitor for these notifications. +For example, Apache Ranger processes these notifications to authorize data access based on classifications. + + +---+++ Notifications - V2: Apache Atlas version 1.0 +Apache Atlas 1.0 sends notifications for following operations on metadata. + +<verbatim> + ENTITY_CREATE: sent when an entity instance is created + ENTITY_UPDATE: sent when an entity instance is updated + ENTITY_DELETE: sent when an entity instance is deleted + CLASSIFICATION_ADD: sent when classifications are added to an entity instance + CLASSIFICATION_UPDATE: sent when classifications of an entity instance are updated + CLASSIFICATION_DELETE: sent when classifications are removed from an entity instance +</verbatim> + +Notification includes the following data. +<verbatim> + AtlasEntity entity; + OperationType operationType; + List<AtlasClassification> classifications; +</verbatim> + +---+++ Notifications - V1: Apache Atlas version 0.8.x and earlier +Notifications from Apache Atlas version 0.8.x and earlier have content formatted differently, as detailed below. + +__Operations__ +<verbatim> + ENTITY_CREATE: sent when an entity instance is created + ENTITY_UPDATE: sent when an entity instance is updated + ENTITY_DELETE: sent when an entity instance is deleted + TRAIT_ADD: sent when classifications are added to an entity instance + TRAIT_UPDATE: sent when classifications of an entity instance are updated + TRAIT_DELETE: sent when classifications are removed from an entity instance +</verbatim> + +Notification includes the following data. +<verbatim> + Referenceable entity; + OperationType operationType; + List<Struct> traits; +</verbatim> + +Apache Atlas 1.0 can be configured to send notifications in older version format, instead of the latest version format. +This can be helpful in deployments that are not yet ready to process notifications in latest version format. +To configure Apache Atlas 1.0 to send notifications in earlier version format, please set following configuration in + atlas-application.properties: + +<verbatim> + atlas.notification.entity.version=v1 +</verbatim> + +---++ Notifications to Apache Atlas +Apache Atlas can be notified of metadata changes and lineage via notifications to Kafka topic named ATLAS_HOOK. +Atlas hooks for Apache Hive/Apache HBase/Apache Storm/Apache Sqoop use this mechanism to notify Apache Atlas of events of interest. + +<verbatim> +ENTITY_CREATE : create an entity. For more details, refer to Java class HookNotificationV1.EntityCreateRequest +ENTITY_FULL_UPDATE : update an entity. For more details, refer to Java class HookNotificationV1.EntityUpdateRequest +ENTITY_PARTIAL_UPDATE : update specific attributes of an entity. For more details, refer to HookNotificationV1.EntityPartialUpdateRequest +ENTITY_DELETE : delete an entity. For more details, refer to Java class HookNotificationV1.EntityDeleteRequest +ENTITY_CREATE_V2 : create an entity. For more details, refer to Java class HookNotification.EntityCreateRequestV2 +ENTITY_FULL_UPDATE_V2 : update an entity. For more details, refer to Java class HookNotification.EntityUpdateRequestV2 +ENTITY_PARTIAL_UPDATE_V2 : update specific attributes of an entity. For more details, refer to HookNotification.EntityPartialUpdateRequestV2 +ENTITY_DELETE_V2 : delete one or more entities. For more details, refer to Java class HookNotification.EntityDeleteRequestV2 +</verbatim> + + + http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/Search-Basic.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/Search-Basic.twiki b/docs/src/site/twiki/Search-Basic.twiki index 367b945..910b50a 100644 --- a/docs/src/site/twiki/Search-Basic.twiki +++ b/docs/src/site/twiki/Search-Basic.twiki @@ -7,114 +7,111 @@ The entire query structure can be represented using the following JSON structure <verbatim> { - "typeName": "hive_table", + "typeName": "hive_column", "excludeDeletedEntities": true, - "classification" : "", - "query": "", - "limit": 25, - "offset": 0, - "entityFilters": { - "attributeName": "name", - "operator": "contains", - "attributeValue": "testtable" - }, - "tagFilters": null, - "attributes": [""] + "classification": "PII", + "query": "", + "offset": 0, + "limit": 25, + "entityFilters": { }, + "tagFilters": { }, + "attributes": [ "table", "qualifiedName"] } </verbatim> __Field description__ - * typeName: The type of entity to look for - * excludeDeletedEntities: Should the search include deleted entities too (default: true) - * classification: Only include entities with given Classification/tag - * query: Any free text occurrence that the entity should have (generic/wildcard queries might be slow) - * limit: Max number of results to fetch - * offset: Starting offset of the result set (useful for pagination) - * entityFilters: Entity Attribute filter(s) - * tagFilters: Classification/tag Attribute filter(s) - * attributes: Attributes to include in the search result (default: include any attribute present in the filter) +<verbatim> + typeName: the type of entity to look for + excludeDeletedEntities: should the search exclude deleted entities? (default: true) + classification: only include entities with given classification + query: any free text occurrence that the entity should have (generic/wildcard queries might be slow) + offset: starting offset of the result set (useful for pagination) + limit: max number of results to fetch + entityFilters: entity attribute filter(s) + tagFilters: classification attribute filter(s) + attributes: attributes to include in the search result +</verbatim> - Attribute based filtering can be done on multiple attributes with AND/OR condition. +<img src="images/twiki/search-basic-hive_column-PII.png" height="400" width="600"/> - *NOTE: The tagFilters and entityFilters field have same JSON structure.* + Attribute based filtering can be done on multiple attributes with AND/OR conditions. __Examples of filtering (for hive_table attributes)__ * Single attribute <verbatim> { - "typeName": "hive_table", + "typeName": "hive_table", "excludeDeletedEntities": true, - "classification" : "", - "query": "", - "limit": 50, - "offset": 0, + "offset": 0, + "limit": 25, "entityFilters": { - "attributeName": "name", - "operator": "contains", - "attributeValue": "testtable" + "attributeName": "name", + "operator": "contains", + "attributeValue": "customers" }, - "tagFilters": null, - "attributes": [""] + "attributes": [ "db", "qualifiedName" ] } </verbatim> + +<img src="images/twiki/search-basic-hive_table-customers.png" height="400" width="600"/> + * Multi-attribute with OR <verbatim> { - "typeName": "hive_table", + "typeName": "hive_table", "excludeDeletedEntities": true, - "classification" : "", - "query": "", - "limit": 50, - "offset": 0, + "offset": 0, + "limit": 25, "entityFilters": { "condition": "OR", "criterion": [ { - "attributeName": "name", - "operator": "contains", - "attributeValue": "testtable" + "attributeName": "name", + "operator": "contains", + "attributeValue": "customers" }, { - "attributeName": "owner", - "operator": "eq", - "attributeValue": "admin" + "attributeName": "name", + "operator": "contains", + "attributeValue": "provider" } ] }, - "tagFilters": null, - "attributes": [""] + "attributes": [ "db", "qualifiedName" ] } </verbatim> + +<img src="images/twiki/search-basic-hive_table-customers-or-provider.png" height="400" width="600"/> + * Multi-attribute with AND <verbatim> { - "typeName": "hive_table", + "typeName": "hive_table", "excludeDeletedEntities": true, - "classification" : "", - "query": "", - "limit": 50, - "offset": 0, + "offset": 0, + "limit": 25, "entityFilters": { "condition": "AND", "criterion": [ { - "attributeName": "name", - "operator": "contains", - "attributeValue": "testtable" + "attributeName": "name", + "operator": "contains", + "attributeValue": "customers" }, { - "attributeName": "owner", - "operator": "eq", - "attributeValue": "admin" + "attributeName": "owner", + "operator": "eq", + "attributeValue": "hive" } ] }, - "tagFilters": null, - "attributes": [""] - } + "attributes": [ "db", "qualifiedName" ] + } </verbatim> +<img src="images/twiki/search-basic-hive_table-customers-owner_is_hive.png" height="400" width="600"/> + __Supported operators for filtering__ * LT (symbols: <, lt) works with Numeric, Date attributes @@ -135,29 +132,28 @@ __CURL Samples__ -u <user>:<password> -X POST -d '{ - "typeName": "hive_table", + "typeName": "hive_table", "excludeDeletedEntities": true, - "classification" : "", - "query": "", - "limit": 50, - "offset": 0, + "classification": "", + "query": "", + "offset": 0, + "limit": 50, "entityFilters": { "condition": "AND", "criterion": [ { - "attributeName": "name", - "operator": "contains", - "attributeValue": "testtable" + "attributeName": "name", + "operator": "contains", + "attributeValue": "customers" }, { - "attributeName": "owner", - "operator": "eq", - "attributeValue": "admin" + "attributeName": "owner", + "operator": "eq", + "attributeValue": "hive" } ] }, - "tagFilters": null, - "attributes": [""] + "attributes": [ "db", "qualifiedName" ] }' <protocol>://<atlas_host>:<atlas_port>/api/atlas/v2/search/basic </verbatim> http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/StormAtlasHook.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/StormAtlasHook.twiki b/docs/src/site/twiki/StormAtlasHook.twiki deleted file mode 100644 index 3e560db..0000000 --- a/docs/src/site/twiki/StormAtlasHook.twiki +++ /dev/null @@ -1,114 +0,0 @@ ----+ Storm Atlas Bridge - ----++ Introduction - -Apache Storm is a distributed real-time computation system. Storm makes it -easy to reliably process unbounded streams of data, doing for real-time -processing what Hadoop did for batch processing. The process is essentially -a DAG of nodes, which is called *topology*. - -Apache Atlas is a metadata repository that enables end-to-end data lineage, -search and associate business classification. - -The goal of this integration is to push the operational topology -metadata along with the underlying data source(s), target(s), derivation -processes and any available business context so Atlas can capture the -lineage for this topology. - -There are 2 parts in this process detailed below: - * Data model to represent the concepts in Storm - * Storm Atlas Hook to update metadata in Atlas - - ----++ Storm Data Model - -A data model is represented as Types in Atlas. It contains the descriptions -of various nodes in the topology graph, such as spouts and bolts and the -corresponding producer and consumer types. - -The following types are added in Atlas. - - * storm_topology - represents the coarse-grained topology. A storm_topology derives from an Atlas Process type and hence can be used to inform Atlas about lineage. - * Following data sets are added - kafka_topic, jms_topic, hbase_table, hdfs_data_set. These all derive from an Atlas Dataset type and hence form the end points of a lineage graph. - * storm_spout - Data Producer having outputs, typically Kafka, JMS - * storm_bolt - Data Consumer having inputs and outputs, typically Hive, HBase, HDFS, etc. - -The Storm Atlas hook auto registers dependent models like the Hive data model -if it finds that these are not known to the Atlas server. - -The data model for each of the types is described in -the class definition at org.apache.atlas.storm.model.StormDataModel. - ----++ Storm Atlas Hook - -Atlas is notified when a new topology is registered successfully in -Storm. Storm provides a hook, backtype.storm.ISubmitterHook, at the Storm client used to -submit a storm topology. - -The Storm Atlas hook intercepts the hook post execution and extracts the metadata from the -topology and updates Atlas using the types defined. Atlas implements the -Storm client hook interface in org.apache.atlas.storm.hook.StormAtlasHook. - - ----++ Limitations - -The following apply for the first version of the integration. - - * Only new topology submissions are registered with Atlas, any lifecycle changes are not reflected in Atlas. - * The Atlas server needs to be online when a Storm topology is submitted for the metadata to be captured. - * The Hook currently does not support capturing lineage for custom spouts and bolts. - - ----++ Installation - -The Storm Atlas Hook needs to be manually installed in Storm on the client side. The hook -artifacts are available at: $ATLAS_PACKAGE/hook/storm - -Storm Atlas hook jars need to be copied to $STORM_HOME/extlib. -Replace STORM_HOME with storm installation path. - -Restart all daemons after you have installed the atlas hook into Storm. - - ----++ Configuration - ----+++ Storm Configuration - -The Storm Atlas Hook needs to be configured in Storm client config -in *$STORM_HOME/conf/storm.yaml* as: - -<verbatim> -storm.topology.submission.notifier.plugin.class: "org.apache.atlas.storm.hook.StormAtlasHook" -</verbatim> - -Also set a 'cluster name' that would be used as a namespace for objects registered in Atlas. -This name would be used for namespacing the Storm topology, spouts and bolts. - -The other objects like data sets should ideally be identified with the cluster name of -the components that generate them. For e.g. Hive tables and databases should be -identified using the cluster name set in Hive. The Storm Atlas hook will pick this up -if the Hive configuration is available in the Storm topology jar that is submitted on -the client and the cluster name is defined there. This happens similarly for HBase -data sets. In case this configuration is not available, the cluster name set in the Storm -configuration will be used. - -<verbatim> -atlas.cluster.name: "cluster_name" -</verbatim> - -In *$STORM_HOME/conf/storm_env.ini*, set an environment variable as follows: - -<verbatim> -STORM_JAR_JVM_OPTS:"-Datlas.conf=$ATLAS_HOME/conf/" -</verbatim> - -where ATLAS_HOME is pointing to where ATLAS is installed. - -You could also set this up programatically in Storm Config as: - -<verbatim> - Config stormConf = new Config(); - ... - stormConf.put(Config.STORM_TOPOLOGY_SUBMISSION_NOTIFIER_PLUGIN, - org.apache.atlas.storm.hook.StormAtlasHook.class.getName()); -</verbatim> http://git-wip-us.apache.org/repos/asf/atlas/blob/880ea4b6/docs/src/site/twiki/index.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/index.twiki b/docs/src/site/twiki/index.twiki index df7e7a3..258dfbb 100755 --- a/docs/src/site/twiki/index.twiki +++ b/docs/src/site/twiki/index.twiki @@ -24,6 +24,7 @@ capabilities around these data assets for data scientists, analysts and the data * Ability to dynamically create classifications - like PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE * Classifications can include attributes - like expiry_date attribute in EXPIRES_ON classification * Entities can be associated with multiple classifications, enabling easier discovery and security enforcement + * Propagation of classifications via lineage - automatically ensures that classifications follow the data as it goes through various processing ---+++ Lineage * Intuitive UI to view lineage of data as it moves through various processes @@ -35,7 +36,8 @@ capabilities around these data assets for data scientists, analysts and the data * SQL like query language to search entities - Domain Specific Language (DSL) ---+++ Security & Data Masking - * Integration with Apache Ranger enables authorization/data-masking based on classifications associated with entities in Apache Atlas. For example: + * Fine grained security for metadata access, enabling controls on access to entity instances and operations like add/update/remove classifications + * Integration with Apache Ranger enables authorization/data-masking on data access based on classifications associated with entities in Apache Atlas. For example: * who can access data classified as PII, SENSITIVE * customer-service users can only see last 4 digits of columns classified as NATIONAL_ID @@ -50,20 +52,18 @@ capabilities around these data assets for data scientists, analysts and the data * [[Architecture][High Level Architecture]] * [[TypeSystem][Type System]] - * [[Search - Basic][Basic Search]] - * [[Search - Advanced][Advanced Search]] + * [[Search - Basic][Search: Basic]] + * [[Search - Advanced][Search: Advanced]] * [[security][Security]] * [[Authentication-Authorization][Authentication and Authorization]] * [[Configuration][Configuration]] - * Notification - * [[Notification-Entity][Entity Notification]] + * [[Notifications][Notifications]] * Hooks & Bridges - * [[Bridge-HBase][HBase Hook & Bridge]] - * [[Bridge-Hive][Hive Hook & Bridge]] + * [[Hook-HBase][HBase Hook & Bridge]] + * [[Hook-Hive][Hive Hook & Bridge]] + * [[Hook-Sqoop][Sqoop Hook]] + * [[Hook-Storm][Storm Hook]] * [[Bridge-Kafka][Kafka Bridge]] - * [[Bridge-Sqoop][Sqoop Hook]] - * [[StormAtlasHook][Storm Hook]] - * [[Bridge-Falcon][Falcon Hook]] * [[HighAvailability][Fault Tolerance And High Availability Options]] ---++ API Documentation