Re: [C++] Potential cache/memory leak when reading parquet

Li Jin Fri, 08 Sep 2023 07:34:06 -0700

Sorry I realized my previous email might have the wrong format. Resending
with correct format.


Update:

I have done a memory profiling and the result seems to suggest memory leak.
I
 have opened a issue to further discuss this:
https://github.com/apache/arrow/issues/37630



On Fri, Sep 8, 2023 at 10:07 AM Li Jin <[email protected]> wrote:

> Update:
>
> I have done a memory profiling and the result seems to suggest memory
> leak. I
>  have opened a issue to further discuss this:
> https://github.com/apache/arrow/issues/37630
>
>
> On Fri, Sep 8, 2023 at 10:04 AM Li Jin <[email protected]> wrote:
>
>> Update:
>>
>> I have done a memory profiling and the result seems to suggest memory
>> leak. I
>>  have opened a issue to further discuss this:
>> https://github.com/apache/arrow/issues/37630
>>
>> Attaching the memory profiling result here as well:
>>
>> On Wed, Sep 6, 2023 at 9:18 PM Gang Wu <[email protected]> wrote:
>>
>>> As suggested from other comments, I also highly recommend using a
>>> heap profiling tool to investigate what's going on there.
>>>
>>> BTW, 800 columns look suspicious to me. Could you try to test them
>>> without reading any batch? Not sure if the file metadata is the root
>>> cause. Or you may want to try another dataset with a smaller number
>>> of columns.
>>>
>>> On Thu, Sep 7, 2023 at 5:45 AM Li Jin <[email protected]> wrote:
>>>
>>> > Correction:
>>> >
>>> > > I tried with both Antione's suggestions (swapping the default
>>> allocator
>>> > and calls ReleaseUnused but neither seem to affect the max rss.
>>> >
>>> > Calling ReleaseUnused does have some effect on the rss - the max rss
>>> goes
>>> > from ~6G -> 5G but still there seems to be something else.
>>> >
>>> > On Wed, Sep 6, 2023 at 4:35 PM Li Jin <[email protected]> wrote:
>>> >
>>> > > Also attaching my experiment code just in case:
>>> > > https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43
>>> > >
>>> > > On Wed, Sep 6, 2023 at 4:29 PM Li Jin <[email protected]> wrote:
>>> > >
>>> > >> Reporting back with some new findings.
>>> > >>
>>> > >> Re Felipe and Antione:
>>> > >> I tried with both Antione's suggestions (swapping the default
>>> allocator
>>> > >> and calls ReleaseUnused but neither seem to affect the max rss. In
>>> > >> addition, I manage to repro the issue by reading a list of n local
>>> > parquet
>>> > >> files that point to the same file, i.e., {"a.parquet", "a.parquet",
>>> ...
>>> > }.
>>> > >> I am also able to crash my process by reading and passing a large
>>> > enough n.
>>> > >> (I observed rss keep going up and eventually the process gets
>>> killed).
>>> > This
>>> > >> observation led me to think there might actually be some memory leak
>>> > issues.
>>> > >>
>>> > >> Re Xuwei:
>>> > >> Thanks for the tips. I am gonna try to memorize this profile next
>>> and
>>> > see
>>> > >> what I can find.
>>> > >>
>>> > >> I am gonna keep looking into this but again, any ideas /
>>> suggestions are
>>> > >> appreciated (and thanks for all the help so far!)
>>> > >>
>>> > >> Li
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >> On Wed, Sep 6, 2023 at 1:59 PM Li Jin <[email protected]>
>>> wrote:
>>> > >>
>>> > >>> Thanks all for the additional suggestions. Will try it but want to
>>> > >>> answer Antoine's question first:
>>> > >>>
>>> > >>> > Which leads to the question: what is your OS?
>>> > >>>
>>> > >>> I am testing this on Debian 5.4.228 x86_64 GNU/Linux
>>> > >>>
>>> > >>> On Wed, Sep 6, 2023 at 1:31 PM wish maple <[email protected]>
>>> > >>> wrote:
>>> > >>>
>>> > >>>> By the way, you can try to use a memory-profiler like [1] and [2]
>>> .
>>> > >>>> It would be help to find how the memory is used
>>> > >>>>
>>> > >>>> Best,
>>> > >>>> Xuwei Fu
>>> > >>>>
>>> > >>>> [1]
>>> > >>>>
>>> https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling
>>> > >>>> [2] https://google.github.io/tcmalloc/gperftools.html
>>> > >>>>
>>> > >>>>
>>> > >>>> Felipe Oliveira Carvalho <[email protected]> 于2023年9月7日周四
>>> 00:28写道：
>>> > >>>>
>>> > >>>> > > (a) stays pretty stable throughout the scan (stays < 1G), (b)
>>> > keeps
>>> > >>>> > increasing during the scan (looks linear to the number of files
>>> > >>>> scanned).
>>> > >>>> >
>>> > >>>> > I wouldn't take this to mean a memory leak but the memory
>>> allocator
>>> > >>>> not
>>> > >>>> > paging out virtual memory that has been allocated throughout the
>>> > scan.
>>> > >>>> > Could you run your workload under a memory profiler?
>>> > >>>> >
>>> > >>>> > (3) Scan the same dataset twice in the same process doesn't
>>> increase
>>> > >>>> the
>>> > >>>> > max rss.
>>> > >>>> >
>>> > >>>> > Another sign this isn't a leak, just the allocator reaching a
>>> level
>>> > of
>>> > >>>> > memory commitment that it doesn't feel like undoing.
>>> > >>>> >
>>> > >>>> > --
>>> > >>>> > Felipe
>>> > >>>> >
>>> > >>>> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <[email protected]>
>>> > wrote:
>>> > >>>> >
>>> > >>>> > > Hello,
>>> > >>>> > >
>>> > >>>> > > I have been testing "What is the max rss needed to scan
>>> through
>>> > >>>> ~100G of
>>> > >>>> > > data in a parquet stored in gcs using Arrow C++".
>>> > >>>> > >
>>> > >>>> > > The current answer is about ~6G of memory which seems a bit
>>> high
>>> > so
>>> > >>>> I
>>> > >>>> > > looked into it. What I observed during the process led me to
>>> think
>>> > >>>> that
>>> > >>>> > > there are some potential cache/memory issues in the
>>> > dataset/parquet
>>> > >>>> cpp
>>> > >>>> > > code.
>>> > >>>> > >
>>> > >>>> > > Main observation:
>>> > >>>> > > (1) As I am scanning through the dataset, I printed out (a)
>>> memory
>>> > >>>> > > allocated by the memory pool from ScanOptions (b) process
>>> rss. I
>>> > >>>> found
>>> > >>>> > that
>>> > >>>> > > while (a) stays pretty stable throughout the scan (stays <
>>> 1G),
>>> > (b)
>>> > >>>> keeps
>>> > >>>> > > increasing during the scan (looks linear to the number of
>>> files
>>> > >>>> scanned).
>>> > >>>> > > (2) I tested ScanNode in Arrow as well as an in-house library
>>> that
>>> > >>>> > > implements its own "S3Dataset" similar to Arrow dataset, both
>>> > >>>> showing
>>> > >>>> > > similar rss usage. (Which led me to think the issue is more
>>> likely
>>> > >>>> to be
>>> > >>>> > in
>>> > >>>> > > the parquet cpp code instead of dataset code).
>>> > >>>> > > (3) Scan the same dataset twice in the same process doesn't
>>> > >>>> increase the
>>> > >>>> > > max rss.
>>> > >>>> > >
>>> > >>>> > > I plan to look into the parquet cpp/dataset code but I wonder
>>> if
>>> > >>>> someone
>>> > >>>> > > has some clues what the issue might be or where to look at?
>>> > >>>> > >
>>> > >>>> > > Thanks,
>>> > >>>> > > Li
>>> > >>>> > >
>>> > >>>> >
>>> > >>>>
>>> > >>>
>>> >
>>>
>>

Re: [C++] Potential cache/memory leak when reading parquet

Reply via email to