[
https://issues.apache.org/jira/browse/ATLAS-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashutosh Mestry reassigned ATLAS-4389:
--------------------------------------
Assignee: Ashutosh Mestry
> Best practice or a way to bring in large number of entities on a regular
> basis.
> -------------------------------------------------------------------------------
>
> Key: ATLAS-4389
> URL: https://issues.apache.org/jira/browse/ATLAS-4389
> Project: Atlas
> Issue Type: Bug
> Components: atlas-core
> Affects Versions: 2.0.0, 2.1.0
> Reporter: Saad
> Assignee: Ashutosh Mestry
> Priority: Major
> Labels: documentation, newbie, performance
> Attachments: image-2021-08-05-11-22-29-259.png,
> image-2021-08-05-11-23-05-440.png
>
>
> Would you be so kind to let us know if there is any best practice or a way to
> bring in large number of entities on a regular basis.
> *Our use case:*
> We will be bringing in around 12,000 datasets, 12,000 jobs and 70,000
> columns. We want to do this as part of our deployment pipeline for other
> upstream projects.
> At every deploy we want to do the following:
> - Add the jobs, datasets and columns that are not in Atlas
> - Update the jobs, datasets and columns that are in Atlas
> - Delete the jobs from Atlas that are deleted from the upstream systems.
> So far we have considered using the bulk API endpoint(/v2/entity/bulk). This
> has its own issues. We found that if the payload is too big in our case
> bigger than 300-500 entities this times out. The more deeper the
> relationships the fewer the entities you can send through the bulk endpoint.
> Inspecting some of the code we feel that both REST and streaming data through
> Kafka follow the same codepath and finally yield the same performance.
> Further we found that when creating entities the type registry becomes the
> bottle neck. We discovered this by profiling the jvm. We found that only one
> core processes the the entities and their relationships.
> *Questions:*
> 1- What is the best practice when bulk loading lots on entities in a
> reasonable time. We are aiming to load 12k jobs, 12k datasets and 70k columns
> in less than 10 mins.?
> 2- Where should we start if we want to scale the API, is there any known way
> to horizontally scale Atlas?
> Here are some of the stats for the load testing we did,
>
> !image-2021-08-05-11-23-05-440.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)