[
https://issues.apache.org/jira/browse/IMPALA-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bikramjeet Vig reassigned IMPALA-7380:
--------------------------------------
Assignee: Alice Fan (was: Yongjun Zhang)
> Untracked memory for file metadata like AvroHeader accumulates until end of
> query
> ---------------------------------------------------------------------------------
>
> Key: IMPALA-7380
> URL: https://issues.apache.org/jira/browse/IMPALA-7380
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Tim Armstrong
> Assignee: Alice Fan
> Priority: Major
> Labels: resource-management
>
> HdfsScanNodeBase maintains a map of per-file metadata objects for use by
> different scan ranges from the same file, e.g. AvroFileHeader. These are not
> cleaned up until the end of the query.
> Note that because of IMPALA-6932 this doesn't necessarily increase peak
> memory significantly (because the headers are all accumulated during the
> header-parsing phase anyway).
> We should track the number of scanners remaining for each file and delete the
> headers when we no longer need them.
> h2. How to reproduce
> Create an Avro table with a large number of files (e.g. 10000).
> Run an Avro scan on a single node:
> {code}
> set num_nodes=1;
> select * from table where foo = 'bar';
> {code}
> Notice on the /memz debug page that untracked memory increases a lot, then
> drops once the query is cancelled or finishes.
> h2. Proposed fix
> Values from HdfsScanNodeBase::per_file_metadata_ should be removed and the
> metadata object deleted once all scanners for that file/partition combination
> are finished. We already know the expected number of scan ranges per file
> from HdfsFileDesc::splits so we can delete the object once all scan ranges
> for the file are finished.
> I can see two options here, both of which involve evicting members from
> per_file_metadata_ at different points:
> # unique ownership: per_file_metadata_ owns the metadata objects via a
> unique_ptr and maintains a refcount that is decremented by the scanner when
> it is done (e.g. by BaseSequenceScanner::Close()).
> # shared ownership: per_file_metadata_ stores shared_ptr and maintains a
> refcount that is decremented when each scanner makes a copy of the
> shared_ptr.
> I think #1 is better since it's more consistent with our usual memory
> management. The nice thing about #2 though is that the interaction with the
> scanners is simpler.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]