[ 
https://issues.apache.org/jira/browse/ATLAS-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saad updated ATLAS-4389:
------------------------
    Description: 
Would you be so kind to let us know if there is any best practice or a way to 
bring in large number of entities on a regular basis.

*Our use case:*

We will be bringing in around 12,000  datasets, 12,000 jobs and 70,000 columns. 
We want to do this as part of our deployment pipeline for other upstream 
projects.

At every deploy we want to do the following:
 - Add the jobs, datasets and columns that are not in Atlas
 - Update the jobs, datasets and columns that are in Atlas
 - Delete the jobs from Atlas that are deleted from the upstream systems.

So far we have considered using the bulk API endpoint(/v2/entity/bulk). This 
has its own issues. We found that if the payload is too big in our case bigger 
than 300-500 entities this times out. The more deeper the relationships the 
fewer the entities you can send through the bulk endpoint.

Inspecting some of the code we feel that both REST and streaming data through 
Kafka follow the same codepath and finally yield the same performance.

Further we found that when creating entities the type registry becomes the 
bottle neck. We discovered this by profiling the jvm. We found that only one 
core processes the the entities and their relationships.

*Questions:*

1- What is the best practice when bulk loading lots on entities in a reasonable 
time. We are aiming to load 12k jobs, 12k datasets and 70k columns in less than 
10 mins.?

2- Where should we start if we want to scale the API, is there any known way to 
horizontally scale Atlas?

Here are some of the stats for the load testing we did,

 

!image-2021-08-05-11-23-05-440.png!

  was:
Would you be so kind to let us know if there is any best practice or a way to 
bring in large number of entities on a regular basis.

*Our use case:*

We will be bringing in around 12,000  datasets, 12,000 jobs and 70,000 columns. 
We want to do this as part of our deployment pipeline for other upstream 
projects.

At every deploy we want to do the following:

- Add the jobs, datasets and columns that are not in Atlas
- Update the jobs, datasets and columns that are in Atlas
- Delete the jobs from Atlas that are deleted from the upstream systems.

So far we have considered using the bulk API endpoint(/v2/entity/bulk). This 
has its own issues. We found that if the payload is too big in our case bigger 
than 300-500 entities this times out. The more deeper the relationships the 
fewer the entities you can send through the bulk endpoint.

Inspecting some of the code we feel that both REST and streaming data through 
Kafka follow the same codepath and finally yield the same performance.

Further we found that when creating entities the type registry becomes the 
bottle neck. We discovered this by profiling the jvm. We found that only one 
core processes the the entities and their relationships.

*Questions:*

1- What is the best practice when bulk loading lots on entities in a reasonable 
time. We are aiming to load 12k jobs, 12k datasets and 70k columns in less than 
10 mins.?

2- Where should we start if we want to scale the API, is there any known way to 
horizontally scale Atlas?

Here are some of the stats for the load testing we did,


> Best practice or a way to bring in large number of entities on a regular 
> basis.
> -------------------------------------------------------------------------------
>
>                 Key: ATLAS-4389
>                 URL: https://issues.apache.org/jira/browse/ATLAS-4389
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: Saad
>            Priority: Major
>              Labels: documentation, newbie, performance
>         Attachments: image-2021-08-05-11-22-29-259.png, 
> image-2021-08-05-11-23-05-440.png
>
>
> Would you be so kind to let us know if there is any best practice or a way to 
> bring in large number of entities on a regular basis.
> *Our use case:*
> We will be bringing in around 12,000  datasets, 12,000 jobs and 70,000 
> columns. We want to do this as part of our deployment pipeline for other 
> upstream projects.
> At every deploy we want to do the following:
>  - Add the jobs, datasets and columns that are not in Atlas
>  - Update the jobs, datasets and columns that are in Atlas
>  - Delete the jobs from Atlas that are deleted from the upstream systems.
> So far we have considered using the bulk API endpoint(/v2/entity/bulk). This 
> has its own issues. We found that if the payload is too big in our case 
> bigger than 300-500 entities this times out. The more deeper the 
> relationships the fewer the entities you can send through the bulk endpoint.
> Inspecting some of the code we feel that both REST and streaming data through 
> Kafka follow the same codepath and finally yield the same performance.
> Further we found that when creating entities the type registry becomes the 
> bottle neck. We discovered this by profiling the jvm. We found that only one 
> core processes the the entities and their relationships.
> *Questions:*
> 1- What is the best practice when bulk loading lots on entities in a 
> reasonable time. We are aiming to load 12k jobs, 12k datasets and 70k columns 
> in less than 10 mins.?
> 2- Where should we start if we want to scale the API, is there any known way 
> to horizontally scale Atlas?
> Here are some of the stats for the load testing we did,
>  
> !image-2021-08-05-11-23-05-440.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to