[ 
https://issues.apache.org/jira/browse/IMPALA-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-7380:
----------------------------------
    Issue Type: Bug  (was: Sub-task)
        Parent:     (was: IMPALA-2885)

> Untracked memory for file metadata like AvroHeader accumulates until end of 
> query
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-7380
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7380
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Tim Armstrong
>            Priority: Major
>              Labels: resource-management
>
> HdfsScanNodeBase maintains a map of per-file metadata objects for use by 
> different scan ranges from the same file, e.g. AvroFileHeader. These are not 
> cleaned up until the end of the query.
> Note that because of IMPALA-6932 this doesn't necessarily increase peak 
> memory significantly (because the headers are all accumulated during the 
> header-parsing phase anyway).
> We should track the number of scanners remaining for each file and delete the 
> headers when we no longer need them.
> h2. How to reproduce 
> Create an Avro table with a large number of files (e.g. 10000).
> Run an Avro scan on a single node:
> {code}
> set num_nodes=1;
> select * from table where foo = 'bar';
> {code}
> Notice on the /memz debug page that untracked memory increases a lot, then 
> drops once the query is cancelled or finishes.
> h2. Proposed fix 
> Values from HdfsScanNodeBase::per_file_metadata_ should be removed and the 
> metadata object deleted once all scanners for that file/partition combination 
> are finished. We already know the expected number of scan ranges per file 
> from HdfsFileDesc::splits so we can delete the object once all scan ranges 
> for the file are finished.
> I can see two options here, both of which involve evicting members from 
> per_file_metadata_ at different points:
> # unique ownership: per_file_metadata_ owns the metadata objects via a 
> unique_ptr and maintains a refcount that is decremented by the scanner when 
> it is done (e.g. by BaseSequenceScanner::Close()). 
> # shared ownership: per_file_metadata_ stores shared_ptr and maintains a 
> refcount that is decremented when each scanner makes a copy of the 
> shared_ptr. 
> I think #1 is better since it's more consistent with our usual memory 
> management. The nice thing about #2 though is that the interaction with the 
> scanners is simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to