Saad created ATLAS-4389:
---------------------------
Summary: Best practice or a way to bring in large number of
entities on a regular basis.
Key: ATLAS-4389
URL: https://issues.apache.org/jira/browse/ATLAS-4389
Project: Atlas
Issue Type: Bug
Components: atlas-core
Affects Versions: 2.1.0, 2.0.0
Reporter: Saad
Would you be so kind to let us know if there is any best practice or a way to
bring in large number of entities on a regular basis.
*Our use case:*
We will be bringing in around 12,000 datasets, 12,000 jobs and 70,000 columns.
We want to do this as part of our deployment pipeline for other upstream
projects.
At every deploy we want to do the following:
- Add the jobs, datasets and columns that are not in Atlas
- Update the jobs, datasets and columns that are in Atlas
- Delete the jobs from Atlas that are deleted from the upstream systems.
So far we have considered using the bulk API endpoint(/v2/entity/bulk). This
has its own issues. We found that if the payload is too big in our case bigger
than 300-500 entities this times out. The more deeper the relationships the
fewer the entities you can send through the bulk endpoint.
Inspecting some of the code we feel that both REST and streaming data through
Kafka follow the same codepath and finally yield the same performance.
Further we found that when creating entities the type registry becomes the
bottle neck. We discovered this by profiling the jvm. We found that only one
core processes the the entities and their relationships.
*Questions:*
1- What is the best practice when bulk loading lots on entities in a reasonable
time. We are aiming to load 12k jobs, 12k datasets and 70k columns in less than
10 mins.?
2- Where should we start if we want to scale the API, is there any known way to
horizontally scale Atlas?
Here are some of the stats for the load testing we did,
--
This message was sent by Atlassian Jira
(v8.3.4#803005)