Adam Rempter created ATLAS-3254:
-----------------------------------

             Summary: Atlas entity with large array of refs causes performance 
issues for lineage
                 Key: ATLAS-3254
                 URL: https://issues.apache.org/jira/browse/ATLAS-3254
             Project: Atlas
          Issue Type: Improvement
          Components:  atlas-core, atlas-webui
    Affects Versions: 2.0.0, 1.0.0
            Reporter: Adam Rempter


We use “aws_s3_pseudo_dir” type from 3020-aws_s3_typedefs.json model.

It has following property: 

"name":        "s3Objects",

"typeName":    "array<aws_s3_object>"

 

Now in AWS buckets you can have thousands of objects. This causes that 
s3Objects array grows quite quickly, causing aws_s3_pseudo_dir entity Json to 
rich easly few MBs.

 

Then we start seeing problems like:
 * UI is dying on displaying entity properties or lineage
 * Error in logs: audit record too long: entityType=aws_s3_pseudo_dir, 
guid=24398271-6ba0-4db5-adfa-38e432dc55ce, size=1053931; maxSize=1048576. 
entity attribute values not stored in audit (EntityAuditListenerV2:234)
 * Some errors with write to HBase (java.lang.IllegalArgumentException: 
KeyValue size too large, as workaround we set hbase.client.keyvalue.maxsize 
param to 0)
 * kafka consumer errors (we can of course set some parameters on consumer, but 
I think it is just workaround)

…

Exception in NotificationHookConsumer (NotificationHookConsumer:332)

org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be 
completed since the group has already rebalanced and assigned the partitions to 
another member. This means that the time between subsequen

t calls to poll() was longer than the configured max.poll.interval.ms, which 
typically implies that the poll loop is spending too much time message 
processing. You can address this either by increasing the sessio

n timeout or by reducing the maximum size of batches returned in poll() with 
max.poll.records.

…

Specifying pseudo_dir is required for s3objects:

name": "pseudoDirectory",
"typeName": "aws_s3_pseudo_dir",
"cardinality": "SINGLE",
"isIndexable": false,
*"isOptional": false,*
"isUnique": false,

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to