Sorry I realized my previous email might have the wrong format. Resending with correct format.
Update: I have done a memory profiling and the result seems to suggest memory leak. I have opened a issue to further discuss this: https://github.com/apache/arrow/issues/37630 On Fri, Sep 8, 2023 at 10:07 AM Li Jin <ice.xell...@gmail.com> wrote: > Update: > > I have done a memory profiling and the result seems to suggest memory > leak. I > have opened a issue to further discuss this: > https://github.com/apache/arrow/issues/37630 > > > On Fri, Sep 8, 2023 at 10:04 AM Li Jin <ice.xell...@gmail.com> wrote: > >> Update: >> >> I have done a memory profiling and the result seems to suggest memory >> leak. I >> have opened a issue to further discuss this: >> https://github.com/apache/arrow/issues/37630 >> >> Attaching the memory profiling result here as well: >> >> On Wed, Sep 6, 2023 at 9:18 PM Gang Wu <ust...@gmail.com> wrote: >> >>> As suggested from other comments, I also highly recommend using a >>> heap profiling tool to investigate what's going on there. >>> >>> BTW, 800 columns look suspicious to me. Could you try to test them >>> without reading any batch? Not sure if the file metadata is the root >>> cause. Or you may want to try another dataset with a smaller number >>> of columns. >>> >>> On Thu, Sep 7, 2023 at 5:45 AM Li Jin <ice.xell...@gmail.com> wrote: >>> >>> > Correction: >>> > >>> > > I tried with both Antione's suggestions (swapping the default >>> allocator >>> > and calls ReleaseUnused but neither seem to affect the max rss. >>> > >>> > Calling ReleaseUnused does have some effect on the rss - the max rss >>> goes >>> > from ~6G -> 5G but still there seems to be something else. >>> > >>> > On Wed, Sep 6, 2023 at 4:35 PM Li Jin <ice.xell...@gmail.com> wrote: >>> > >>> > > Also attaching my experiment code just in case: >>> > > https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43 >>> > > >>> > > On Wed, Sep 6, 2023 at 4:29 PM Li Jin <ice.xell...@gmail.com> wrote: >>> > > >>> > >> Reporting back with some new findings. >>> > >> >>> > >> Re Felipe and Antione: >>> > >> I tried with both Antione's suggestions (swapping the default >>> allocator >>> > >> and calls ReleaseUnused but neither seem to affect the max rss. In >>> > >> addition, I manage to repro the issue by reading a list of n local >>> > parquet >>> > >> files that point to the same file, i.e., {"a.parquet", "a.parquet", >>> ... >>> > }. >>> > >> I am also able to crash my process by reading and passing a large >>> > enough n. >>> > >> (I observed rss keep going up and eventually the process gets >>> killed). >>> > This >>> > >> observation led me to think there might actually be some memory leak >>> > issues. >>> > >> >>> > >> Re Xuwei: >>> > >> Thanks for the tips. I am gonna try to memorize this profile next >>> and >>> > see >>> > >> what I can find. >>> > >> >>> > >> I am gonna keep looking into this but again, any ideas / >>> suggestions are >>> > >> appreciated (and thanks for all the help so far!) >>> > >> >>> > >> Li >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> On Wed, Sep 6, 2023 at 1:59 PM Li Jin <ice.xell...@gmail.com> >>> wrote: >>> > >> >>> > >>> Thanks all for the additional suggestions. Will try it but want to >>> > >>> answer Antoine's question first: >>> > >>> >>> > >>> > Which leads to the question: what is your OS? >>> > >>> >>> > >>> I am testing this on Debian 5.4.228 x86_64 GNU/Linux >>> > >>> >>> > >>> On Wed, Sep 6, 2023 at 1:31 PM wish maple <maplewish...@gmail.com> >>> > >>> wrote: >>> > >>> >>> > >>>> By the way, you can try to use a memory-profiler like [1] and [2] >>> . >>> > >>>> It would be help to find how the memory is used >>> > >>>> >>> > >>>> Best, >>> > >>>> Xuwei Fu >>> > >>>> >>> > >>>> [1] >>> > >>>> >>> https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling >>> > >>>> [2] https://google.github.io/tcmalloc/gperftools.html >>> > >>>> >>> > >>>> >>> > >>>> Felipe Oliveira Carvalho <felipe...@gmail.com> 于2023年9月7日周四 >>> 00:28写道: >>> > >>>> >>> > >>>> > > (a) stays pretty stable throughout the scan (stays < 1G), (b) >>> > keeps >>> > >>>> > increasing during the scan (looks linear to the number of files >>> > >>>> scanned). >>> > >>>> > >>> > >>>> > I wouldn't take this to mean a memory leak but the memory >>> allocator >>> > >>>> not >>> > >>>> > paging out virtual memory that has been allocated throughout the >>> > scan. >>> > >>>> > Could you run your workload under a memory profiler? >>> > >>>> > >>> > >>>> > (3) Scan the same dataset twice in the same process doesn't >>> increase >>> > >>>> the >>> > >>>> > max rss. >>> > >>>> > >>> > >>>> > Another sign this isn't a leak, just the allocator reaching a >>> level >>> > of >>> > >>>> > memory commitment that it doesn't feel like undoing. >>> > >>>> > >>> > >>>> > -- >>> > >>>> > Felipe >>> > >>>> > >>> > >>>> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com> >>> > wrote: >>> > >>>> > >>> > >>>> > > Hello, >>> > >>>> > > >>> > >>>> > > I have been testing "What is the max rss needed to scan >>> through >>> > >>>> ~100G of >>> > >>>> > > data in a parquet stored in gcs using Arrow C++". >>> > >>>> > > >>> > >>>> > > The current answer is about ~6G of memory which seems a bit >>> high >>> > so >>> > >>>> I >>> > >>>> > > looked into it. What I observed during the process led me to >>> think >>> > >>>> that >>> > >>>> > > there are some potential cache/memory issues in the >>> > dataset/parquet >>> > >>>> cpp >>> > >>>> > > code. >>> > >>>> > > >>> > >>>> > > Main observation: >>> > >>>> > > (1) As I am scanning through the dataset, I printed out (a) >>> memory >>> > >>>> > > allocated by the memory pool from ScanOptions (b) process >>> rss. I >>> > >>>> found >>> > >>>> > that >>> > >>>> > > while (a) stays pretty stable throughout the scan (stays < >>> 1G), >>> > (b) >>> > >>>> keeps >>> > >>>> > > increasing during the scan (looks linear to the number of >>> files >>> > >>>> scanned). >>> > >>>> > > (2) I tested ScanNode in Arrow as well as an in-house library >>> that >>> > >>>> > > implements its own "S3Dataset" similar to Arrow dataset, both >>> > >>>> showing >>> > >>>> > > similar rss usage. (Which led me to think the issue is more >>> likely >>> > >>>> to be >>> > >>>> > in >>> > >>>> > > the parquet cpp code instead of dataset code). >>> > >>>> > > (3) Scan the same dataset twice in the same process doesn't >>> > >>>> increase the >>> > >>>> > > max rss. >>> > >>>> > > >>> > >>>> > > I plan to look into the parquet cpp/dataset code but I wonder >>> if >>> > >>>> someone >>> > >>>> > > has some clues what the issue might be or where to look at? >>> > >>>> > > >>> > >>>> > > Thanks, >>> > >>>> > > Li >>> > >>>> > > >>> > >>>> > >>> > >>>> >>> > >>> >>> > >>> >>