[ https://issues.apache.org/jira/browse/IMPALA-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Armstrong updated IMPALA-7380: ---------------------------------- Issue Type: Bug (was: Sub-task) Parent: (was: IMPALA-2885) > Untracked memory for file metadata like AvroHeader accumulates until end of > query > --------------------------------------------------------------------------------- > > Key: IMPALA-7380 > URL: https://issues.apache.org/jira/browse/IMPALA-7380 > Project: IMPALA > Issue Type: Bug > Components: Backend > Reporter: Tim Armstrong > Priority: Major > Labels: resource-management > > HdfsScanNodeBase maintains a map of per-file metadata objects for use by > different scan ranges from the same file, e.g. AvroFileHeader. These are not > cleaned up until the end of the query. > Note that because of IMPALA-6932 this doesn't necessarily increase peak > memory significantly (because the headers are all accumulated during the > header-parsing phase anyway). > We should track the number of scanners remaining for each file and delete the > headers when we no longer need them. > h2. How to reproduce > Create an Avro table with a large number of files (e.g. 10000). > Run an Avro scan on a single node: > {code} > set num_nodes=1; > select * from table where foo = 'bar'; > {code} > Notice on the /memz debug page that untracked memory increases a lot, then > drops once the query is cancelled or finishes. > h2. Proposed fix > Values from HdfsScanNodeBase::per_file_metadata_ should be removed and the > metadata object deleted once all scanners for that file/partition combination > are finished. We already know the expected number of scan ranges per file > from HdfsFileDesc::splits so we can delete the object once all scan ranges > for the file are finished. > I can see two options here, both of which involve evicting members from > per_file_metadata_ at different points: > # unique ownership: per_file_metadata_ owns the metadata objects via a > unique_ptr and maintains a refcount that is decremented by the scanner when > it is done (e.g. by BaseSequenceScanner::Close()). > # shared ownership: per_file_metadata_ stores shared_ptr and maintains a > refcount that is decremented when each scanner makes a copy of the > shared_ptr. > I think #1 is better since it's more consistent with our usual memory > management. The nice thing about #2 though is that the interaction with the > scanners is simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org