-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57649/
-----------------------------------------------------------
(Updated March 16, 2017, 10:37 p.m.)
Review request for atlas and Madhan Neethiraj.
Changes
-------
Contents:
- Fixed export progress reporting.
- Addressed review comments.
Bugs: ATLAS-1503
https://issues.apache.org/jira/browse/ATLAS-1503
Repository: atlas
Description
-------
**Background**
==============
Existing implementation of Export API w.r.t ZIP file genration adds 1 *.json*
file per entitiy. This makes ZIP file creation inefficient. The ZIP files are
75% larger in size than what could be possible with fewer *.json* file entries.
**Solution**
============
The implementation uses the new v2 API *AtlasEntityWithExtInfo* representation
instead of *AtlasEntity*. This format combines an entity with related entities
as one. E.g. *hive_table* will contain all the *hive_columns* that it is made
up of. (See example section below.)
This results in significant reduction of generated *JSON* files. This impacts
reduction in generated *ZIP* file.
**Implementation Details**
==========================
*Export API*
- Modified *Gremlin* used to fetch connected entities to return *guid* with
*boolean* to indicate if the entity is process or not.
- _ExportService_ Modified implementation to fetch *AtlasEntityWithExtInfo*
instead of *AtlasEntity*. Modified book keeping to save *process* (lineage)
entities after all non-process entities are saved.
- _ZipSink_ Minor modification to serialize *AtlasEntityWithExtInfo*.
*Import API*
- _ZipSource_ Modified to source *AtlasEntityWithExtInfo*.
- _EntityImportStream_ Modified to source *AtlasEntityWithExtInfo*.
- _AtlasEntityStreamForImport.getGuid_ Modified to source requested entities
first from stored *AtlasEntityWithExtInfo* object. Request from stream only if
not found.
- _AtlasEntityStoreV1.bulkImport_ Minor modification to use the new changes to
stream.
**Functional Areas Impacted**
=============================
*Export*
- Full
- Connected
- HDFS path-based import.
*Import*
- Regular flow.
**Examples**
============
Case *hive_db*: Within the GraphDB the database has inward edges from objects
that refer to it. Tables in this case. So *AtlasEntityWithExtInfo* for database
will not have any referred entities.
Case of *hive_table*: Within the GraphDB the table has outward edges pointing
to the columns it is made up of. It also has edges pointing to database and
storage descriptor. Hence, the *AtlasEntityWithExtInfo* for table will have all
full representation of all the columns and reference to database and storage
descriptor.
**Metrics**
===========
Date | File Size | No. of Entities | Export |Import |
| | | Duration |Duration |
-----|-----------|-----------------|----------|---------|
3/02 | 180 MB | 202930 | 22 mins| 1:38 hrs|
3/08 | 7 KB | 3 | 5 secs| 7 sec|
--------------------------------------------------------|
Improvement |
--------------------------------------------------------|
3/14 | 38 MB | 202930 | 32 mins| 1:40 hrs|
3/14 | 5 KB | 3 | 5 secs| 7 sec|
**Summary**
===========
With these changes the file size reduction is: ~65%.
Diffs (updated)
-----
intg/src/main/java/org/apache/atlas/model/impexp/AtlasExportResult.java
e6a967e
intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java 4e3895d
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityGraphDiscoveryV1.java
6c88510
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStoreV1.java
cce3fca
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStream.java
5d9a7d4
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java
8cb36ac
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/EntityStream.java
4c43921
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/InMemoryMapEntityStream.java
241f6d0
repository/src/main/java/org/apache/atlas/util/AtlasGremlin2QueryProvider.java
4743b73
webapp/src/main/java/org/apache/atlas/web/resources/AdminResource.java
31a4cf9
webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java
c1891e0
webapp/src/main/java/org/apache/atlas/web/resources/ZipSink.java 2e4cb01
webapp/src/main/java/org/apache/atlas/web/resources/ZipSource.java a69f7fa
Diff: https://reviews.apache.org/r/57649/diff/3/
Changes: https://reviews.apache.org/r/57649/diff/2-3/
Testing
-------
Test data:
- QuickStart_v1: 3 databases.
- A *hive_db* with 922 tables.
- Stocks *hive_db* with 1 database, table, process and 5 columns.
- A *hive_db* with 522K entities.
The changes impact all the flows in the Export & Import APIs.
Unit testing: Manual.
Integration testing: Manual.
Accuracy testing: Manual. Verified using Export -> Import -> Export -> file
compare.
Thanks,
Ashutosh Mestry