Re: Review Request 57649: Export API: ZIP File Size Optimization

2017-03-19 Thread Apoorv Naik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57649/#review169384
---



Uh oh, Jenkins tried to build the changes and failed. See [atlas-UTs 
#1](http://osboxes:8080/job/atlas-UTs/1/).

- Apoorv Naik


On March 17, 2017, 5:09 a.m., Ashutosh Mestry wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57649/
> ---
> 
> (Updated March 17, 2017, 5:09 a.m.)
> 
> 
> Review request for atlas and Madhan Neethiraj.
> 
> 
> Bugs: ATLAS-1665
> https://issues.apache.org/jira/browse/ATLAS-1665
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> **Background**
> ==
> Existing implementation of Export API w.r.t ZIP file genration adds 1 *.json* 
> file per entitiy. This makes ZIP file creation inefficient. The ZIP files are 
> 75% larger in size than what could be possible with fewer *.json* file 
> entries.
> 
> **Solution**
> 
> The implementation uses the new v2 API *AtlasEntityWithExtInfo* 
> representation instead of *AtlasEntity*. This format combines an entity with 
> related entities as one. E.g. *hive_table* will contain all the 
> *hive_columns* that it is made up of. (See example section below.)
> 
> This results in significant reduction of generated *JSON* files. This impacts 
> reduction in generated *ZIP* file.
> 
> **Implementation Details**
> ==
> *Export API*
> - Modified *Gremlin* used to fetch connected entities to return *guid* with 
> *boolean* to indicate if the entity is process or not.
> - _ExportService_ Modified implementation to fetch *AtlasEntityWithExtInfo* 
> instead of *AtlasEntity*. Modified book keeping to save *process* (lineage) 
> entities after all non-process entities are saved.
> - _ZipSink_ Minor modification to serialize  *AtlasEntityWithExtInfo*.
> 
> *Import API*
> - _ZipSource_ Modified to source *AtlasEntityWithExtInfo*.
> - _EntityImportStream_ Modified to source *AtlasEntityWithExtInfo*.
> - _AtlasEntityStreamForImport.getGuid_ Modified  to source requested entities 
> first from stored *AtlasEntityWithExtInfo* object. Request from stream only 
> if not found.
> - _AtlasEntityStoreV1.bulkImport_ Minor modification to use the new changes 
> to stream.
> 
> 
> **Functional Areas Impacted**
> =
> *Export*
> - Full
> - Connected
> - HDFS path-based import.
> 
> *Import*
> - Regular flow.
> 
> **Examples**
> 
> Case *hive_db*: Within the GraphDB the database has inward edges from objects 
> that refer to it. Tables in this case. So *AtlasEntityWithExtInfo* for 
> database will not have any referred entities.
> 
> Case of *hive_table*: Within the GraphDB the table has outward edges pointing 
> to the columns it is made up of. It also has edges pointing to database and 
> storage descriptor. Hence, the *AtlasEntityWithExtInfo* for table will have 
> all full representation of all the columns and reference to database and 
> storage descriptor.
> 
> **Metrics**
> ===
> 
> Date | File Size | No. of Entities | Export   |Import   |
>  |   | | Duration |Duration |
> -|---|-|--|-|
> 3/02 |180 MB |  202930 |   22 mins| 1:38 hrs|
> 3/08 |  7 KB |   3 |5 secs|7 sec|
> |
>Improvement  | 
> |
> 3/14 | 38 MB |  202930 |   20 mins| 1:10 hrs| 
> 3/14 |  5 KB |   3 |5 secs|7 sec|
> 
> 
> **Summary**
> ===
> With these changes the file size reduction is: ~65%.
> 
> 
> Diffs
> -
> 
>   intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java 4e3895d 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityGraphDiscoveryV1.java
>  6c88510 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStoreV1.java
>  cce3fca 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStream.java
>  5d9a7d4 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java
>  8cb36ac 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/EntityStream.java
>  4c43921 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/InMemoryMapEntityStream.java
>  241f6d0 
>   
> repository/src/main/java/org/apache/atlas/util/AtlasGremlin2QueryProvider.java
>  4743b73 
>   webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java 
> e123ff7 
>   

Re: Review Request 57649: Export API: ZIP File Size Optimization

2017-03-19 Thread Apoorv Naik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57649/#review169383
---



Jenkins is going to check this review request...

- Apoorv Naik


On March 17, 2017, 5:09 a.m., Ashutosh Mestry wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57649/
> ---
> 
> (Updated March 17, 2017, 5:09 a.m.)
> 
> 
> Review request for atlas and Madhan Neethiraj.
> 
> 
> Bugs: ATLAS-1665
> https://issues.apache.org/jira/browse/ATLAS-1665
> 
> 
> Repository: atlas
> 
> 
> Description
> ---
> 
> **Background**
> ==
> Existing implementation of Export API w.r.t ZIP file genration adds 1 *.json* 
> file per entitiy. This makes ZIP file creation inefficient. The ZIP files are 
> 75% larger in size than what could be possible with fewer *.json* file 
> entries.
> 
> **Solution**
> 
> The implementation uses the new v2 API *AtlasEntityWithExtInfo* 
> representation instead of *AtlasEntity*. This format combines an entity with 
> related entities as one. E.g. *hive_table* will contain all the 
> *hive_columns* that it is made up of. (See example section below.)
> 
> This results in significant reduction of generated *JSON* files. This impacts 
> reduction in generated *ZIP* file.
> 
> **Implementation Details**
> ==
> *Export API*
> - Modified *Gremlin* used to fetch connected entities to return *guid* with 
> *boolean* to indicate if the entity is process or not.
> - _ExportService_ Modified implementation to fetch *AtlasEntityWithExtInfo* 
> instead of *AtlasEntity*. Modified book keeping to save *process* (lineage) 
> entities after all non-process entities are saved.
> - _ZipSink_ Minor modification to serialize  *AtlasEntityWithExtInfo*.
> 
> *Import API*
> - _ZipSource_ Modified to source *AtlasEntityWithExtInfo*.
> - _EntityImportStream_ Modified to source *AtlasEntityWithExtInfo*.
> - _AtlasEntityStreamForImport.getGuid_ Modified  to source requested entities 
> first from stored *AtlasEntityWithExtInfo* object. Request from stream only 
> if not found.
> - _AtlasEntityStoreV1.bulkImport_ Minor modification to use the new changes 
> to stream.
> 
> 
> **Functional Areas Impacted**
> =
> *Export*
> - Full
> - Connected
> - HDFS path-based import.
> 
> *Import*
> - Regular flow.
> 
> **Examples**
> 
> Case *hive_db*: Within the GraphDB the database has inward edges from objects 
> that refer to it. Tables in this case. So *AtlasEntityWithExtInfo* for 
> database will not have any referred entities.
> 
> Case of *hive_table*: Within the GraphDB the table has outward edges pointing 
> to the columns it is made up of. It also has edges pointing to database and 
> storage descriptor. Hence, the *AtlasEntityWithExtInfo* for table will have 
> all full representation of all the columns and reference to database and 
> storage descriptor.
> 
> **Metrics**
> ===
> 
> Date | File Size | No. of Entities | Export   |Import   |
>  |   | | Duration |Duration |
> -|---|-|--|-|
> 3/02 |180 MB |  202930 |   22 mins| 1:38 hrs|
> 3/08 |  7 KB |   3 |5 secs|7 sec|
> |
>Improvement  | 
> |
> 3/14 | 38 MB |  202930 |   20 mins| 1:10 hrs| 
> 3/14 |  5 KB |   3 |5 secs|7 sec|
> 
> 
> **Summary**
> ===
> With these changes the file size reduction is: ~65%.
> 
> 
> Diffs
> -
> 
>   intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java 4e3895d 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityGraphDiscoveryV1.java
>  6c88510 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStoreV1.java
>  cce3fca 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStream.java
>  5d9a7d4 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java
>  8cb36ac 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/EntityStream.java
>  4c43921 
>   
> repository/src/main/java/org/apache/atlas/repository/store/graph/v1/InMemoryMapEntityStream.java
>  241f6d0 
>   
> repository/src/main/java/org/apache/atlas/util/AtlasGremlin2QueryProvider.java
>  4743b73 
>   webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java 
> e123ff7 
>   webapp/src/main/java/org/apache/atlas/web/resources/ZipSink.java 37d9eb5 
>   

Re: [jira] [Commented] (ATLAS-1410) V2 Glossary API

2017-03-19 Thread Russell Anderson


These points that Mandy raises needs to be addressed.

Russ

Sent from my iPhone

> On Feb 19, 2017, at 6:37 AM, Mandy Chessell (JIRA) 
wrote:
>
>
>
[ 
https://issues.apache.org/jira/browse/ATLAS-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873650#comment-15873650
 ]

>
> Mandy Chessell commented on ATLAS-1410:
> ---
>
> Comments on V1.0
>
> - Page numbers would help to tie these comments to the document.
> - Page 2 - Asset type - defined in terms of itself.  How are they used?
or is this not relevant to this paper?
> - Page 2 - Why do we need to know about V1 and V2?  I think it is because
the current interfaces works with V1 and the new one will work with V2 - it
would be helpful to state this explicitly.
> - Page 4 - bullets 4-5 - has-a and is-a relationships are semantic
relationships.
> - Page 4 - missing from list - ability to associate a semantic meaning to
a classification (v2), trait (v1)?
> - Page 4 - Missing from the list - "typed-by" relationship to associate
terms that include meaning in context with terms that describe more pure
objects.  For example Home Address is typed by Address.
> - Page 5 - Figure 1 - I am not comfortable with terms being owned by
categories.  I think each terms should be owned by a glossary and linked
into 0, 1 or more categories as appropriate.  This creates a much simpler
deletion rule for the API/End user - particularly when you look at Figure 2
where terms are owned by multiple categories. IE, delete term from its
glossary and it is deleted.  In the proposed design, it raises such
questions as "Is the term deleted when unlinked from all categories - or
the first category it is linked to?"
> - Page 6 - Figure 3 - I need more detail to understand the "classifies"
relationship and how it relates to a classification.  It seems redundant.
Would you not relate a term to a classification which is in itself
semantically classified by its definition term?
> - Page 6 - Bullet 6) - What is the alternative to using Gremlin queries?
> - Page 6 - Bullet 7) - is this an incomplete sentence - or does the
paragraph that follows supposed to be a nested bullet list?  Assuming it is
a follow on point.  My confusion is that I do not understand why the
term/category hierarchy is relevant to the enhancement of classifications?
The Classification object is defining the type of classification and its
meaning is coming from the term?  Is this suggesting that the relationships
between classifications is coming from the term relationships in the same
way we do thin in IGC today?  If so it may help to show an example?
> - Page 7 - Figure 4 and 5 - what is the difference between
"Classification" and "Classification Relationship"?
> - Page 7 - Maybe strange examples - the Glossaries would be for different
subject areas - for example, there may be a marketing glossary, a customer
care glossary, a banking glossary.  These may be used for associating
meaning to data assets (ie data assets).  there may also be glossaries for
different regulations, or standard governance approaches, and these may
include terms that can be used to describe classification for data that
drive operational governance?
> - Page 8 - I am not sure what the proposed enhancements are - it just
seems to list the problems with the current model.  All relationships in
metadata are bi-directional.  It should be the default.  This mechanism
seems complicated.  Really need to define relationships independent of
entities so we can define attributes on these relationships.  The
Classification is actually an example of an independently defined
relationship that includes the GUID of the 2 entities it connects.   This
should be the common style of relationship.
> - Page 9 - on discussion point - a Taxonomy is a hierarchy of categories
that the terms are placed in - I thought this was included in the proposal
and we do need this for organising terms so that people can find them - and
the category hierarchies (taxonomies) help to provide context to terms too.
Also, the semantic relationships discussed would mean we could support a
simple ontology.
> - Page 9 - Fully-qualified name - What a grandparent or parent term?
What does a fully qualified name mean and when is it used?  The unique name
is its GUID.  Its path name (there may be many) is the navigation to the
term through the category hierarchies.
> - Page 9 - why do Atlas terms need to follow the schema in defined at
this link -
https://www.ibm.com/support/knowledgecenter/en/SSN364_8.8.0/com.ibm.ima.using/comp/vocab/terms_prop.html?
   it seem to imply a lifecycle which is not included in this proposal and
a very specific modelling of the IBM industry models that have mandatory
fields that are not always applicable to all glossaries.  I think this doc
should describe the schema of the glossary term explicitly and explain the
fields.
> - page 10 - Figure 7 shows the navigation relationships and 1 

[jira] [Commented] (ATLAS-1410) V2 Glossary API

2017-03-19 Thread Stefhan van Helvoirt (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931905#comment-15931905
 ] 

Stefhan van Helvoirt commented on ATLAS-1410:
-

Page 5:
Use case 1: "It is important that duplicate glossary terms names can be defined 
in the glossary", No. Within a single glossary, it should not be possible to 
have multiple terms with identical names. If the requirement arises to maintain 
duplicates then this should be done across multiple glossaries. This is in 
alignment with the idea of having 'departmental glossaries'.

Use case 2: There might be a need to have two different types of 
categorizations. One for scoping / context and another for adding generic 
characteristics on the set of contained terms. In other glossaries this is 
sometimes referred to as a Parent Category and Referencing Categories. Each 
term has only one Parent Category but could be referenced by multiple other 
categories. These referencing categories have a scoping purpose, while the 
parent category could also have tags / characteristics that will be inherited 
to the contained terms. 

Page 6: Regarding discussion point 'There does not appear to be a need for a 
Glossary Term to have a special “parent” category, as the Glossary owns the 
Glossary Term' If you want to manage a collection of terms in a similar way 
within a glossary then some form of parent category or unique structuring 
method is needed. If there is no uniqueness then multiple groupings with 
different characteristics can collide.

Page 7: "Glossary Term names might not be unique in a Glossary." No, see 
earlier comments. 
"This is a name containing a term’s
inheritance and the Glossary it comes from." Only Glossary + Term name should 
be sufficient, no need to add parent terms in the fully qualified name. 

Unclear why there is a need to have different term types. From a business 
perspective there is only one type of term. These various types such as Concept 
and Attribute suggest something technical which is not relevant from a term 
perspective as they are written from a business view. Also, a term can be a 
concept in one context and a attribute in another, how is that handled with 
this setup? E.g. 'Email address' is a attribute of 'Customer' and a Concept in 
the structure 'Location' --> 'Address' --> 'Electronic address' --> 'Email 
address'.


Page 8 "A Classification points to one entity and can have many associated 
term." I don't think i fully understand this statement. It would be wiser to 
have the classification point to one or more terms and that the term will point 
to one or more entities. To be further discussed. Also it should be possible to 
have multiple classification pointing to the same object. 

Page 9 "The classification associated with the term should not be automatically 
cascaded by Atlas to the assigned assets." Agree that Atlas does not 
necessarily needs to do the cascading because logic might need to be involved. 
However, the result might need to be made available in Atlas and shown in a 
distinct way. If Atlas is seen as the single source of truth then it must be 
possible for a end user to see from solely Atlas that a classification is 
'Derived from'. How that derivation has occurred can happen by a different 
service. 

Stopped after page 11. Will continue to review remaining pages in the coming 
days. 

> V2 Glossary API
> ---
>
> Key: ATLAS-1410
> URL: https://issues.apache.org/jira/browse/ATLAS-1410
> Project: Atlas
>  Issue Type: Improvement
>Reporter: David Radley
>Assignee: David Radley
> Attachments: Atlas Glossary V2 proposal v1.0.pdf, Atlas Glossary V2 
> proposal v1.1.pdf
>
>
> The BaseResourceDefinition uses the AttributeDefintion class from typesystem. 
> There are newer more funcitonal versions of this capability in the atlas-intg 
> project. This Jira is changing over the glossary implementation to the newer 
> entity / type classes.  
> Instread of the instanceProperties and collectionProperties in the 
> BaseResourceDefintions we should use something in this sort of style :  
> "
>  AtlasEntityDef deptTypeDef =
> AtlasTypeUtil.createClassTypeDef(DEPARTMENT_TYPE, 
> "Department"+_description, ImmutableSet.of(),
> AtlasTypeUtil.createRequiredAttrDef("name", "string"),
> new AtlasAttributeDef("employees", 
> String.format("array<%s>", "Person"), true,
> AtlasAttributeDef.Cardinality.SINGLE, 0, 1, 
> false, false,
> 
> Collections.emptyList()));
> AtlasEntityDef personTypeDef = 
> AtlasTypeUtil.createClassTypeDef("Person", "Person"+_description, 
> ImmutableSet.of(),
> AtlasTypeUtil.createRequiredAttrDef("name", "string"),
>