Repository: atlas Updated Branches: refs/heads/master c65586f13 -> 5c2f7a0c8 (forced update)
http://git-wip-us.apache.org/repos/asf/atlas/blob/5c2f7a0c/docs/src/site/twiki/TypeSystem.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/TypeSystem.twiki b/docs/src/site/twiki/TypeSystem.twiki index b658cfa..397a1bb 100755 --- a/docs/src/site/twiki/TypeSystem.twiki +++ b/docs/src/site/twiki/TypeSystem.twiki @@ -1,7 +1,6 @@ ---+ Type System ---++ Overview - Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called âtypesâ. Instances of âtypesâ called âentitiesâ represent the actual metadata objects that are managed. The Type System is a component that allows users to define and manage the types and entities. All metadata objects managed by @@ -9,7 +8,6 @@ Atlas out of the box (like Hive tables, for e.g.) are modelled using types and r types of metadata in Atlas, one needs to understand the concepts of the type system component. ---++ Types - A âTypeâ in Atlas is a definition of how a particular type of metadata objects are stored and accessed. A type represents one or a collection of attributes that define the properties for the metadata object. Users with a development background will recognize the similarity of a type to a âClassâ definition of object oriented programming @@ -19,108 +17,100 @@ An example of a type that comes natively defined with Atlas is a Hive table. A H attributes: <verbatim> -Name: hive_table -MetaType: Class -SuperTypes: DataSet +Name: hive_table +TypeCategory: Entity +SuperTypes: DataSet Attributes: - name: String (name of the table) - db: Database object of type hive_db - owner: String - createTime: Date - lastAccessTime: Date - comment: String - retention: int - sd: Storage Description object of type hive_storagedesc - partitionKeys: Array of objects of type hive_column - aliases: Array of strings - columns: Array of objects of type hive_column - parameters: Map of String keys to String values - viewOriginalText: String - viewExpandedText: String - tableType: String - temporary: Boolean -</verbatim> + name: string + db: hive_db + owner: string + createTime: date + lastAccessTime: date + comment: string + retention: int + sd: hive_storagedesc + partitionKeys: array<hive_column> + aliases: array<string> + columns: array<hive_column> + parameters: map<string,string> + viewOriginalText: string + viewExpandedText: string + tableType: string + temporary: boolean</verbatim> The following points can be noted from the above example: * A type in Atlas is identified uniquely by a ânameâ - * A type has a metatype. A metatype represents the type of this model in Atlas. Atlas has the following metatypes: - * Basic metatypes: E.g. Int, String, Boolean etc. + * A type has a metatype. Atlas has the following metatypes: + * Primitive metatypes: boolean, byte, short, int, long, float, double, biginteger, bigdecimal, string, date * Enum metatypes - * Collection metatypes: E.g. Array, Map - * Composite metatypes: E.g. Class, Struct, Trait - * A type can âextendâ from a parent type called âsupertypeâ - by virtue of this, it will get to include the attributes that are defined in the supertype as well. This allows modellers to define common attributes across a set of related types etc. This is again similar to the concept of how Object Oriented languages define super classes for a class. It is also possible for a type in Atlas to extend from multiple super types. + * Collection metatypes: array, map + * Composite metatypes: Entity, Struct, Classification, Relationship + * Entity & Classification types can âextendâ from other types, called âsupertypeâ - by virtue of this, it will get to include the attributes that are defined in the supertype as well. This allows modellers to define common attributes across a set of related types etc. This is again similar to the concept of how Object Oriented languages define super classes for a class. It is also possible for a type in Atlas to extend from multiple super types. * In this example, every hive table extends from a pre-defined supertype called a âDataSetâ. More details about this pre-defined types will be provided later. - * Types which have a metatype of âClassâ, âStructâ or âTraitâ can have a collection of attributes. Each attribute has a name (e.g. ânameâ) and some other associated properties. A property can be referred to using an expression type_name.attribute_name. It is also good to note that attributes themselves are defined using Atlas metatypes. + * Types which have a metatype of âEntityâ, âStructâ, âClassificationâ or 'Relationship' can have a collection of attributes. Each attribute has a name (e.g. ânameâ) and some other associated properties. A property can be referred to using an expression type_name.attribute_name. It is also good to note that attributes themselves are defined using Atlas metatypes. * In this example, hive_table.name is a String, hive_table.aliases is an array of Strings, hive_table.db refers to an instance of a type called hive_db and so on. - * Type references in attributes, (like hive_table.db) are particularly interesting. Note that using such an attribute, we can define arbitrary relationships between two types defined in Atlas and thus build rich models. Note that one can also collect a list of references as an attribute type (e.g. hive_table.cols which represents a list of references from hive_table to the hive_column type) + * Type references in attributes, (like hive_table.db) are particularly interesting. Note that using such an attribute, we can define arbitrary relationships between two types defined in Atlas and thus build rich models. Note that one can also collect a list of references as an attribute type (e.g. hive_table.columns which represents a list of references from hive_table to hive_column type) ---++ Entities - -An âentityâ in Atlas is a specific value or instance of a Class âtypeâ and thus represents a specific metadata object +An âentityâ in Atlas is a specific value or instance of an Entity âtypeâ and thus represents a specific metadata object in the real world. Referring back to our analogy of Object Oriented Programming languages, an âinstanceâ is an âObjectâ of a certain âClassâ. An example of an entity will be a specific Hive Table. Say Hive has a table called âcustomersâ in the âdefaultâ -database. This table will be an âentityâ in Atlas of type hive_table. By virtue of being an instance of a class +database. This table will be an âentityâ in Atlas of type hive_table. By virtue of being an instance of an entity type, it will have values for every attribute that are a part of the Hive table âtypeâ, such as: <verbatim> -id: "9ba387dd-fa76-429c-b791-ffc338d3c91f" -typeName: âhive_tableâ +guid: "9ba387dd-fa76-429c-b791-ffc338d3c91f" +typeName: "hive_table" +status: "ACTIVE" values: - name: âcustomersâ - db: "b42c6cfc-c1e7-42fd-a9e6-890e0adf33bc" - owner: âadminâ - createTime: "2016-06-20T06:13:28.000Z" - lastAccessTime: "2016-06-20T06:13:28.000Z" - comment: null - retention: 0 - sd: "ff58025f-6854-4195-9f75-3a3058dd8dcf" - partitionKeys: null - aliases: null - columns: ["65e2204f-6a23-4130-934a-9679af6a211f", "d726de70-faca-46fb-9c99-cf04f6b579a6", ...] - parameters: {"transient_lastDdlTime": "1466403208"} + name: âcustomersâ + db: { "guid": "b42c6cfc-c1e7-42fd-a9e6-890e0adf33bc", "typeName": "hive_db" } + owner: âadminâ + createTime: 1490761686029 + updateTime: 1516298102877 + comment: null + retention: 0 + sd: { "guid": "ff58025f-6854-4195-9f75-3a3058dd8dcf", "typeName": "hive_storagedesc" } + partitionKeys: null + aliases: null + columns: [ { "guid": ""65e2204f-6a23-4130-934a-9679af6a211f", "typeName": "hive_column" }, { "guid": ""d726de70-faca-46fb-9c99-cf04f6b579a6", "typeName": "hive_column" }, ...] + parameters: { "transient_lastDdlTime": "1466403208"} viewOriginalText: null viewExpandedText: null - tableType: âMANAGED_TABLEâ - temporary: false -</verbatim> + tableType: âMANAGED_TABLEâ + temporary: false</verbatim> The following points can be noted from the example above: - * Every entity that is an instance of a Class type is identified by a unique identifier, a GUID. This GUID is generated by the Atlas server when the object is defined, and remains constant for the entire lifetime of the entity. At any point in time, this particular entity can be accessed using its GUID. + * Every instance ofan entity type is identified by a unique identifier, a GUID. This GUID is generated by the Atlas server when the object is defined, and remains constant for the entire lifetime of the entity. At any point in time, this particular entity can be accessed using its GUID. * In this example, the âcustomersâ table in the default database is uniquely identified by the GUID "9ba387dd-fa76-429c-b791-ffc338d3c91f" * An entity is of a given type, and the name of the type is provided with the entity definition. * In this example, the âcustomersâ table is a âhive_table. * The values of this entity are a map of all the attribute names and their values for attributes that are defined in the hive_table type definition. - * Attribute values will be according to the metatype of the attribute. - * Basic metatypes: integer, String, boolean values. E.g. ânameâ = âcustomersâ, âTemporaryâ = âfalseâ - * Collection metatypes: An array or map of values of the contained metatype. E.g. parameters = { âtransient_lastDdlTimeâ: â1466403208â} - * Composite metatypes: For classes, the value will be an entity with which this particular entity will have a relationship. E.g. The hive table âcustomersâ is present in a database called âdefaultâ. The relationship between the table and database are captured via the âdbâ attribute. Hence, the value of the âdbâ attribute will be a GUID that uniquely identifies the hive_db entity called âdefaultâ - -With this idea on entities, we can now see the difference between Class and Struct metatypes. Classes and Structs -both compose attributes of other types. However, entities of Class types have the Id attribute (with a GUID value) a -nd can be referenced from other entities (like a hive_db entity is referenced from a hive_table entity). Instances of -Struct types do not have an identity of their own. The value of a Struct type is a collection of attributes that are + * Attribute values will be according to the datatype of the attribute. Entity-type attributes will have value of type AtlasObjectId + +With this idea on entities, we can now see the difference between Entity and Struct metatypes. Entities and Structs +both compose attributes of other types. However, instances of Entity types have an identity (with a GUID value) and can +be referenced from other entities (like a hive_db entity is referenced from a hive_table entity). Instances of Struct +types do not have an identity of their own. The value of a Struct type is a collection of attributes that are âembeddedâ inside the entity itself. ---++ Attributes - -We already saw that attributes are defined inside composite metatypes like Class and Struct. But we simplistically -referred to attributes as having a name and a metatype value. However, attributes in Atlas have some more properties -that define more concepts related to the type system. +We already saw that attributes are defined inside metatypes like Entity, Struct, Classification and Relationship. But we +implistically referred to attributes as having a name and a metatype value. However, attributes in Atlas have some more +properties that define more concepts related to the type system. An attribute has the following properties: <verbatim> - name: string, - dataTypeName: string, - isComposite: boolean, + name: string, + typeName: string, + isOptional: boolean, isIndexable: boolean, - isUnique: boolean, - multiplicity: enum, - reverseAttributeName: string -</verbatim> + isUnique: boolean, + cardinality: enum</verbatim> The properties above have the following meanings: @@ -132,7 +122,7 @@ The properties above have the following meanings: * isIndexable - * This flag indicates whether this property should be indexed on, so that look ups can be performed using the attribute value as a predicate and can be performed efficiently. * isUnique - - * This flag is again related to indexing. If specified to be unique, it means that a special index is created for this attribute in Titan that allows for equality based look ups. + * This flag is again related to indexing. If specified to be unique, it means that a special index is created for this attribute in JanusGraph that allows for equality based look ups. * Any attribute with a true value for this flag is treated like a primary key to distinguish this entity from other entities. Hence care should be taken ensure that this attribute does model a unique property in real world. * For e.g. consider the name attribute of a hive_table. In isolation, a name is not a unique attribute for a hive_table, because tables with the same name can exist in multiple databases. Even a pair of (database name, table name) is not unique if Atlas is storing metadata of hive tables amongst multiple clusters. Only a cluster location, database name and table name can be deemed unique in the physical world. * multiplicity - indicates whether this attribute is required, optional, or could be multi-valued. If an entityâs definition of the attribute value does not match the multiplicity declaration in the type definition, this would be a constraint violation and the entity addition will fail. This field can therefore be used to define some constraints on the metadata information. @@ -142,59 +132,55 @@ Let us look at the attribute called âdbâ which represents the database to wh <verbatim> db: - "dataTypeName": "hive_db", - "isComposite": false, + "name": "db", + "typeName": "hive_db", + "isOptional": false, "isIndexable": true, - "isUnique": false, - "multiplicity": "required", - "name": "db", - "reverseAttributeName": null -</verbatim> + "isUnique": false, + "cardinality": "SINGLE"</verbatim> -Note the ârequiredâ constraint on multiplicity. A table entity cannot be sent without a db reference. +Note the âisOptional=trueâ constraint - a table entity cannot be created without a db reference. <verbatim> columns: - "dataTypeName": "array<hive_column>", - "isComposite": true, + "name": "columns", + "typeName": "array<hive_column>", + "isOptional": optional, "isIndexable": true, - âisUnique": false, - "multiplicity": "optional", - "name": "columns", - "reverseAttributeName": null -</verbatim> + âisUnique": false, + "constraints": [ { "type": "ownedRef" } ]</verbatim> -Note the âisCompositeâ true value for columns. By doing this, we are indicating that the defined column entities should +Note the âownedRefâ constraint for columns. By doing this, we are indicating that the defined column entities should always be bound to the table entity they are defined with. From this description and examples, you will be able to realize that attribute definitions can be used to influence specific modelling behavior (constraints, indexing, etc) to be enforced by the Atlas system. ---++ System specific types and their significance - -Atlas comes with a few pre-defined system types. We saw one example (DataSet) in the preceding sections. In this -section we will see all these types and understand their significance. +Atlas comes with a few pre-defined system types. We saw one example (DataSet) in preceding sections. In this +section we will see more of these types and understand their significance. *Referenceable*: This type represents all entities that can be searched for using a unique attribute called qualifiedName. -*Asset*: This type contains attributes like name, description and owner. Name is a required attribute -(multiplicity = required), the others are optional. The purpose of Referenceable and Asset is to provide modellers -with way to enforce consistency when defining and querying entities of their own types. Having these fixed set of -attributes allows applications and User interfaces to make convention based assumptions about what attributes they can -expect of types by default. - -*Infrastructure*: This type extends Referenceable and Asset and typically can be used to be a common super type for -infrastructural metadata objects like clusters, hosts etc. - -*!DataSet*: This type extends Referenceable and Asset. Conceptually, it can be used to represent an type that stores -data. In Atlas, hive tables, Sqoop RDBMS tables etc are all types that extend from !DataSet. Types that extend !DataSet -can be expected to have a Schema in the sense that they would have an attribute that defines attributes of that dataset. -For e.g. the columns attribute in a hive_table. Also entities of types that extend !DataSet participate in data -transformation and this transformation can be captured by Atlas via lineage (or provenance) graphs. - -*Process*: This type extends Referenceable and Asset. Conceptually, it can be used to represent any data transformation -operation. For example, an ETL process that transforms a hive table with raw data to another hive table that stores -some aggregate can be a specific type that extends the Process type. A Process type has two specific attributes, -inputs and outputs. Both inputs and outputs are arrays of !DataSet entities. Thus an instance of a Process type can -use these inputs and outputs to capture how the lineage of a !DataSet evolves. \ No newline at end of file +*Asset*: This type extends Referenceable and adds attributes like name, description and owner. Name is a required +attribute (isOptional=false), the others are optional. + +The purpose of Referenceable and Asset is to provide modellers with way to enforce consistency when defining and +querying entities of their own types. Having these fixed set of attributes allows applications and user interfaces to +make convention based assumptions about what attributes they can expect of types by default. + +*Infrastructure*: This type extends Asset and typically can be used to be a common super type for infrastructural +metadata objects like clusters, hosts etc. + +*!DataSet*: This type extends Referenceable. Conceptually, it can be used to represent an type that stores data. In Atlas, +hive tables, hbase_tables etc are all types that extend from !DataSet. Types that extend !DataSet can be expected to have +a Schema in the sense that they would have an attribute that defines attributes of that dataset. For e.g. the columns +attribute in a hive_table. Also entities of types that extend !DataSet participate in data transformation and this +transformation can be captured by Atlas via lineage (or provenance) graphs. + +*Process*: This type extends Asset. Conceptually, it can be used to represent any data transformation operation. For +example, an ETL process that transforms a hive table with raw data to another hive table that stores some aggregate can +be a specific type that extends the Process type. A Process type has two specific attributes, inputs and outputs. Both +inputs and outputs are arrays of !DataSet entities. Thus an instance of a Process type can use these inputs and outputs +to capture how the lineage of a !DataSet evolves. http://git-wip-us.apache.org/repos/asf/atlas/blob/5c2f7a0c/docs/src/site/twiki/index.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/index.twiki b/docs/src/site/twiki/index.twiki index a8e7de9..198ac52 100755 --- a/docs/src/site/twiki/index.twiki +++ b/docs/src/site/twiki/index.twiki @@ -7,39 +7,49 @@ Atlas is a scalable and extensible set of core foundational governance services enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem. +Apache Atlas provides open metadata management and governance capabilities for organizations +to build a catalog of their data assets, classify and govern these assets and provide collaboration +capabilities around these data assets for data scientists, analysts and the data governance team. + ---++ Features ----+++ Data Classification - * Import or define taxonomy business-oriented annotations for data - * Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes - * Export metadata to third-party systems +---+++ Metadata types & instances + * Pre-defined types for various Hadoop and non-Hadoop metadata + * Ability to define new types for the metadata to be managed + * Types can have primitive attributes, complex attributes, object references; can inherit from other types + * Instances of types, called entities, capture metadata object details and their relationships + * REST APIs to work with types and instances allow easier integration + +---+++ Classification + * Ability to dynamically create classifications - like PII, EXPIRES_ON, DATA_QUALITY, SENSITIVE + * Classifications can include attributes - like expiry_date attribute in EXPIRES_ON classification + * Entities can be associated with multiple classifications, enabling easier discovery and security enforcement ----+++ Centralized Auditing - * Capture security access information for every application, process, and interaction with data - * Capture the operational information for execution, steps, and activities +---+++ Lineage + * Intuitive UI to view lineage of data as it moves through various processes + * REST APIs to access and update lineage ----+++ Search & Lineage (Browse) - * Pre-defined navigation paths to explore the data classification and audit information - * Text-based search features locates relevant data and audit event across Data Lake quickly and accurately - * Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information +---+++ Search/Discovery + * Intuitive UI to search entities by type, classification, attribute value or free-text + * Rich REST APIs to search by complex criteria + * SQL like query language to search entities - Domain Specific Language (DSL) ----+++ Security & Policy Engine - * Rationalize compliance policy at runtime based on data classification schemes, attributes and roles. - * Advanced definition of policies for preventing data derivation based on classification (i.e. re-identification) â Prohibitions - * Column and Row level masking based on cell values and attibutes. +---+++ Security & Data Masking + * Integration with Apache Ranger enables authorization/data-masking based on classifications associated with entities in Apache Atlas. For example: + * who can access data classified as PII, SENSITIVE + * customer-service users can only see last 4 digits of columns classified as NATIONAL_ID ---++ Getting Started - * [[InstallationSteps][Install Steps]] - * [[QuickStart][Quick Start Guide]] + * [[InstallationSteps][Build & Install]] + * [[QuickStart][Quick Start]] ---++ Documentation * [[Architecture][High Level Architecture]] * [[TypeSystem][Type System]] - * [[Repository][Metadata Repository]] * [[Search][Search]] * [[security][Security]] * [[Authentication-Authorization][Authentication and Authorization]] http://git-wip-us.apache.org/repos/asf/atlas/blob/5c2f7a0c/docs/src/site/twiki/security.twiki ---------------------------------------------------------------------- diff --git a/docs/src/site/twiki/security.twiki b/docs/src/site/twiki/security.twiki index 1dcee2a..84b9425 100755 --- a/docs/src/site/twiki/security.twiki +++ b/docs/src/site/twiki/security.twiki @@ -43,7 +43,7 @@ The properties for configuring service authentication are: * <code>atlas.authentication.keytab</code> - the path to the keytab file. * <code>atlas.authentication.principal</code> - the principal to use for authenticating to the KDC. The principal is generally of the form "user/host@realm". You may use the '_HOST' token for the hostname and the local hostname will be substituted in by the runtime (e.g. "Atlas/[email protected]"). -Note that when Atlas is configured with HBase as the storage backend in a secure cluster, the graph db (titan) needs sufficient user permissions to be able to create and access an HBase table. To grant the appropriate permissions see [[Configuration][Graph persistence engine - Hbase]]. +Note that when Atlas is configured with HBase as the storage backend in a secure cluster, the graph db (JanusGraph) needs sufficient user permissions to be able to create and access an HBase table. To grant the appropriate permissions see [[Configuration][Graph persistence engine - Hbase]]. ---+++ JAAS configuration
