rkundam opened a new pull request, #671: URL: https://github.com/apache/atlas/pull/671
To support high-volume metadata ingestion while preserving lineage accuracy and data consistency, https://issues.apache.org/jira/browse/ATLAS-5320 introduced a distributed parallel processing architecture. Rather than relying on a single-threaded sequential pipeline, the system partitions entity workloads deterministically and processes independent entity families concurrently. How was this patch tested? Tested with different sets of data in clusters and compared with Serial Processing for the same datasets. Ex: For below dataset, Serial Processing took around 5hrs and Distributed Parallel processing with 3 metadata and 3 lineage topics it took 1.5hrs. Tables: 6.5K Column: 130K Lineage: 1.7K -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
