[jira] [Created] (ATLAS-4389) Best practice or a way to bring in large number of entities on a regular basis.

Saad (Jira) Thu, 05 Aug 2021 08:19:20 -0700

Saad created ATLAS-4389:
---------------------------

             Summary: Best practice or a way to bring in large number of 
entities on a regular basis.
                 Key: ATLAS-4389
                 URL: https://issues.apache.org/jira/browse/ATLAS-4389
             Project: Atlas
          Issue Type: Bug
          Components:  atlas-core
    Affects Versions: 2.1.0, 2.0.0
            Reporter: Saad



Would you be so kind to let us know if there is any best practice or a way to 
bring in large number of entities on a regular basis.

*Our use case:*

We will be bringing in around 12,000  datasets, 12,000 jobs and 70,000 columns. 
We want to do this as part of our deployment pipeline for other upstream 
projects.

At every deploy we want to do the following:

- Add the jobs, datasets and columns that are not in Atlas
- Update the jobs, datasets and columns that are in Atlas
- Delete the jobs from Atlas that are deleted from the upstream systems.

So far we have considered using the bulk API endpoint(/v2/entity/bulk). This 
has its own issues. We found that if the payload is too big in our case bigger 
than 300-500 entities this times out. The more deeper the relationships the 
fewer the entities you can send through the bulk endpoint.

Inspecting some of the code we feel that both REST and streaming data through 
Kafka follow the same codepath and finally yield the same performance.

Further we found that when creating entities the type registry becomes the 
bottle neck. We discovered this by profiling the jvm. We found that only one 
core processes the the entities and their relationships.

*Questions:*

1- What is the best practice when bulk loading lots on entities in a reasonable 
time. We are aiming to load 12k jobs, 12k datasets and 70k columns in less than 
10 mins.?

2- Where should we start if we want to scale the API, is there any known way to 
horizontally scale Atlas?

Here are some of the stats for the load testing we did,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ATLAS-4389) Best practice or a way to bring in large number of entities on a regular basis.

Reply via email to