[jira] [Comment Edited] (ATLAS-4389) Best practice or a way to bring in large number of entities on a regular basis.

Ashutosh Mestry (Jira) Mon, 20 Sep 2021 10:38:06 -0700


    [ 
https://issues.apache.org/jira/browse/ATLAS-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417764#comment-17417764
 ]


Ashutosh Mestry edited comment on ATLAS-4389 at 9/20/21, 5:37 PM:
------------------------------------------------------------------

Sorry for the delay in replying.

Background: Existing implementation of ingest has linear complexity. This is 
done to be able to deal with the create/update/delete message types and the 
temporal nature of these operations.

Here are few that I have tried and worked as solutions for some of our 
customers:

*Approach 1*

Pre-requisite: Entity creation is in your control. 

Solution:
 * Create topologically sorted entities. Parent entities are created before the 
child entities.
 * Create lineage entities after parent participating entities are created.
 * Use REST APIs to concurrently create entities of a type. Start new type only 
after all entities of a type are exhausted.

This is the advantage of being able to create entities concurrently as their 
dependents are already created. This approach gives high throughput and 
continues to maintain consistency of data.

This needs some amount of book-keeping. This may not be a lot if you are 
creating Hive entities and follow a consistent pattern for coming up with names 
for _qualifiedName_ unique attribute.

In my test: I was able to run between 25 to 50 concurrent workers all creating 
entity of a type. 

About code paths: Ingest via Kakfa queue, entity creation via REST APIs and 
ingest via Import API all follow same code path.

 


was (Author: ashutoshm):
Sorry for the delay in replying.

Background: Existing implementation of ingest has linear complexity. This is 
done to be able to deal with the create/update/delete message types and the 
temporal nature of these operations.

Here are few that I have tried and worked as solutions for some of our 
customers:

*Approach 1*

Pre-requisite: Entity creation is in your control. 

Solution: 
 * Create topologically sorted entities. Parent entities are created before the 
child entities. 
 * Create lineage entities after parent participating entities are created.
 * Use REST APIs to concurrently create entities of a type. Start new type only 
after all entities of a type are exhausted.

This is the advantage of being able to create entities concurrently as their 
dependents are already created. This approach gives high throughput and 
continues to maintain consistency of data.

This needs some amount of book-keeping. This may not be a lot if you are 
creating Hive entities and follow a consistent pattern for coming up with names 
for _qualifiedName_ unique attribute.

About code paths: Ingest via Kakfa queue, entity creation via REST APIs and 
ingest via Import API all follow same code path.

 

> Best practice or a way to bring in large number of entities on a regular 
> basis.
> -------------------------------------------------------------------------------
>
>                 Key: ATLAS-4389
>                 URL: https://issues.apache.org/jira/browse/ATLAS-4389
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: Saad
>            Assignee: Ashutosh Mestry
>            Priority: Major
>              Labels: documentation, newbie, performance
>         Attachments: image-2021-08-05-11-22-29-259.png, 
> image-2021-08-05-11-23-05-440.png
>
>
> Would you be so kind to let us know if there is any best practice or a way to 
> bring in large number of entities on a regular basis.
> *Our use case:*
> We will be bringing in around 12,000  datasets, 12,000 jobs and 70,000 
> columns. We want to do this as part of our deployment pipeline for other 
> upstream projects.
> At every deploy we want to do the following:
>  - Add the jobs, datasets and columns that are not in Atlas
>  - Update the jobs, datasets and columns that are in Atlas
>  - Delete the jobs from Atlas that are deleted from the upstream systems.
> So far we have considered using the bulk API endpoint(/v2/entity/bulk). This 
> has its own issues. We found that if the payload is too big in our case 
> bigger than 300-500 entities this times out. The more deeper the 
> relationships the fewer the entities you can send through the bulk endpoint.
> Inspecting some of the code we feel that both REST and streaming data through 
> Kafka follow the same codepath and finally yield the same performance.
> Further we found that when creating entities the type registry becomes the 
> bottle neck. We discovered this by profiling the jvm. We found that only one 
> core processes the the entities and their relationships.
> *Questions:*
> 1- What is the best practice when bulk loading lots on entities in a 
> reasonable time. We are aiming to load 12k jobs, 12k datasets and 70k columns 
> in less than 10 mins.?
> 2- Where should we start if we want to scale the API, is there any known way 
> to horizontally scale Atlas?
> Here are some of the stats for the load testing we did,
>  
> !image-2021-08-05-11-23-05-440.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ATLAS-4389) Best practice or a way to bring in large number of entities on a regular basis.

Reply via email to