----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/57649/#review169083 -----------------------------------------------------------
repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java Lines 31 (patched) <https://reviews.apache.org/r/57649/#comment241400> Instead of adding nextWithExtInfo(), consider updating AtlasEntityStream.next() to return AtlasEntityWithExtInfo - as shown below: class AtlasEntityStream { public AtlasEntityWithExtInfo getNext() { return iterator.hasNext() ? new AtlasEntityWithExtInfo(iterator.next(), this.entitiesWithExtInfo) : null; } } With this change, following methods can be removed: AtlasEntityStreamForImport.nextWithExtInfo() AtlasEntityStreamForImport.getByGuid() EntityImportStream.nextWithExtInfo() webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java Line 224 (original), 215 (patched) <https://reviews.apache.org/r/57649/#comment241402> entityWithExtInfo.getReferredEntities() - this could be null. Please review all usage and handle this case. webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java Lines 294 (patched) <https://reviews.apache.org/r/57649/#comment241403> Consider sending direction as a parameter to addToBeProcessed(guid, isLineage, direction) and remove line #293, #297, #343. addToBeProcessed() can update the directory when isLineage=true, if needed. webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java Line 398 (original), 395 (patched) <https://reviews.apache.org/r/57649/#comment241404> "Object" ==> "Boolean"? webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java Lines 439 (patched) <https://reviews.apache.org/r/57649/#comment241405> Consider renaming: "ListOptmizedForContains" ==> "UniqueList" webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java Lines 463 (patched) <https://reviews.apache.org/r/57649/#comment241406> list.addAll() may end up adding duplicate items to the list. Consider iterating 's' and add only elements that are not present in 'set' - Madhan Neethiraj On March 15, 2017, 11:04 p.m., Ashutosh Mestry wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/57649/ > ----------------------------------------------------------- > > (Updated March 15, 2017, 11:04 p.m.) > > > Review request for atlas and Madhan Neethiraj. > > > Bugs: ATLAS-1503 > https://issues.apache.org/jira/browse/ATLAS-1503 > > > Repository: atlas > > > Description > ------- > > **Background** > ============== > Existing implementation of Export API w.r.t ZIP file genration adds 1 *.json* > file per entitiy. This makes ZIP file creation inefficient. The ZIP files are > 75% larger in size than what could be possible with fewer *.json* file > entries. > > **Solution** > ============ > The implementation uses the new v2 API *AtlasEntityWithExtInfo* > representation instead of *AtlasEntity*. This format combines an entity with > related entities as one. E.g. *hive_table* will contain all the > *hive_columns* that it is made up of. (See example section below.) > > This results in significant reduction of generated *JSON* files. This impacts > reduction in generated *ZIP* file. > > **Implementation Details** > ========================== > *Export API* > - Modified *Gremlin* used to fetch connected entities to return *guid* with > *boolean* to indicate if the entity is process or not. > - _ExportService_ Modified implementation to fetch *AtlasEntityWithExtInfo* > instead of *AtlasEntity*. Modified book keeping to save *process* (lineage) > entities after all non-process entities are saved. > - _ZipSink_ Minor modification to serialize *AtlasEntityWithExtInfo*. > > *Import API* > - _ZipSource_ Modified to source *AtlasEntityWithExtInfo*. > - _EntityImportStream_ Modified to source *AtlasEntityWithExtInfo*. > - _AtlasEntityStreamForImport.getGuid_ Modified to source requested entities > first from stored *AtlasEntityWithExtInfo* object. Request from stream only > if not found. > - _AtlasEntityStoreV1.bulkImport_ Minor modification to use the new changes > to stream. > > > **Functional Areas Impacted** > ============================= > *Export* > - Full > - Connected > - HDFS path-based import. > > *Import* > - Regular flow. > > **Examples** > ============ > Case *hive_db*: Within the GraphDB the database has inward edges from objects > that refer to it. Tables in this case. So *AtlasEntityWithExtInfo* for > database will not have any referred entities. > > Case of *hive_table*: Within the GraphDB the table has outward edges pointing > to the columns it is made up of. It also has edges pointing to database and > storage descriptor. Hence, the *AtlasEntityWithExtInfo* for table will have > all full representation of all the columns and reference to database and > storage descriptor. > > **Metrics** > =========== > > Date | File Size | No. of Entities | Export | > | | | Duration | > -----|-----------|-----------------|----------| > 3/02 | 180 MB | 202930 | 22 mins| > 3/08 | 7 KB | 3 | 5 secs| > ----------------------------------------------| > Improvement | > ----------------------------------------------| > 3/14 | 38 MB | 202930 | 19 mins| > 3/14 | 5 KB | 3 | 5 secs| > > > **Summary** > =========== > With these changes the file size reduction is: ~65%. > > > Diffs > ----- > > intg/src/main/java/org/apache/atlas/model/impexp/AtlasExportResult.java > e6a967e > intg/src/main/java/org/apache/atlas/model/instance/AtlasEntity.java 4e3895d > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStoreV1.java > cce3fca > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStream.java > 5d9a7d4 > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/AtlasEntityStreamForImport.java > 8cb36ac > > repository/src/main/java/org/apache/atlas/repository/store/graph/v1/EntityImportStream.java > 73994b9 > > repository/src/main/java/org/apache/atlas/util/AtlasGremlin2QueryProvider.java > 4743b73 > webapp/src/main/java/org/apache/atlas/web/resources/AdminResource.java > 31a4cf9 > webapp/src/main/java/org/apache/atlas/web/resources/ExportService.java > c1891e0 > webapp/src/main/java/org/apache/atlas/web/resources/ZipSink.java 2e4cb01 > webapp/src/main/java/org/apache/atlas/web/resources/ZipSource.java a69f7fa > > > Diff: https://reviews.apache.org/r/57649/diff/1/ > > > Testing > ------- > > Test data: > - QuickStart_v1: 3 databases. > - A *hive_db* with 922 tables. > - Stocks *hive_db* with 1 database, table, process and 5 columns. > - A *hive_db* with 522K entities. > > The changes impact all the flows in the Export & Import APIs. > Unit testing: Manual. > Integration testing: Manual. > Accuracy testing: Manual. Verified using Export -> Import -> Export -> file > compare. > > > Thanks, > > Ashutosh Mestry > >
