Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

Enrico Minack Thu, 31 Mar 2022 06:15:46 -0700

How well Spark can scale up with your data (in terms of years of data)depends on two things: the operations performed on the data, andcharacteristics of the data, like value distributions.

Failing tasks smell like you are using operations that do not scale(e.g. Cartesian product of your data, join on low-cardinality row). Butthat could be anything.

Again, the reasons for these failing tasks can be manifold, and withoutthe actual transformations (i.e. your "complex statements"), and somecharacteristics of your data, no specific help is possible.


Enrico


Am 31.03.22 um 10:30 schrieb Joris Billen:

Thanks for reply :-)

I am using pyspark. Basicially my code (simplified is):

df=spark.read.csv(hdfs://somehdfslocation)
df1=spark.sql (complex statement using df)
...
dfx=spark.sql(complex statement using df x-1)
...
dfx15.write()
What exactly is meant by "closing resources"? Is it just unpersistingcached dataframes at the end and stopping the spark contextexplicitly: sc.stop()?
FOr processing many years at once versus a chunk in a loop: I see thatif I go up to certain number of days, one iteration will start to havetasks that fail. So I only take a limited number of days, and do thisprocess several times. Isnt this normal as you are always somehowlimited in terms of resources (I have 9 nodes wiht 32GB). Or is itlike this that in theory you could process any volume, in case youwait long enough? I guess spark can only break down the tasks up to acertain level (based on the datasets' and the intermediate results’partitions) and at some moment you hit the limit where your resourcesare not sufficient anymore to process such one task? Maybe you cantweak it a bit, but in the end you’ll hit a limit?
Concretely following topics would be interesting to find out moreabout (links):-where to see what you are still consuming after spark job ended ifyou didnt close resources
-memory leaks for pyspark
-good article about closing resources (you find tons of snippets onhow to start spark context+ config for number/cores/memory ofworker/executors etc, but never saw a focus on making sure you cleanup —> or is it just stopping the spark context)
On 30 Mar 2022, at 21:24, Bjørn Jørgensen <bjornjorgen...@gmail.com>wrote:
It`s quite impossible for anyone to answer your question about whatis eating your memory, without even knowing what language you are using.
If you are using C then it`s always pointers, that's the mem issue.
If you are using python, there can be some like not using contextmanager like With Context Managers and Python's with Statement<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Frealpython.com%2Fpython-with-statement%2F&data=04%7C01%7Cjoris.billen%40bigindustries.be%7C4ed0d54ebb1949dd7dc708da1282e90b%7C49c3d703357947bfa8887c913fbdced9%7C0%7C0%7C637842650741571990%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=DfS3z2ahDT9B21NnbbN8AlEI3G2JX2FCwo9ZZCuzPVs%3D&reserved=0>
And another can be not to close resources after use.
In my experience you can process 3 years or more of data, IF you areclosing opened resources.
I use the web GUI http://spark:4040 to follow what spark is doing.
ons. 30. mar. 2022 kl. 17:41 skrev Joris Billen<joris.bil...@bigindustries.be>:
    Thanks for answer-much appreciated! This forum is very useful :-)

    I didnt know the sparkcontext stays alive. I guess this is eating
    up memory.  The eviction means that he knows that he should clear
    some of the old cached memory to be able to store new one. In
    case anyone has good articles about memory leaks I would be
    interested to read.
    I will try to add following lines at the end of my job (as I
    cached the table in spark sql):


    /sqlContext.sql("UNCACHE TABLE mytableofinterest ")/
    /spark.stop()/


    Wrt looping: if I want to process 3 years of data, my modest
    cluster will never do it one go , I would expect? I have to break
    it down in smaller pieces and run that in a loop (1 day is
    already lots of data).



    Thanks!
    On 30 Mar 2022, at 17:25, Sean Owen <sro...@gmail.com> wrote:

    The Spark context does not stop when a job does. It stops when
    you stop it. There could be many ways mem can leak. Caching
    maybe - but it will evict. You should be clearing caches when no
    longer needed.

    I would guess it is something else your program holds on to in
    its logic.

    Also consider not looping; there is probably a faster way to do
    it in one go.

    On Wed, Mar 30, 2022, 10:16 AM Joris Billen
    <joris.bil...@bigindustries.be> wrote:

        Hi,
        I have a pyspark job submitted through spark-submit that
        does some heavy processing for 1 day of data. It runs with
        no errors. I have to loop over many days, so I run this
        spark job in a loop. I notice after couple executions the
        memory is increasing on all worker nodes and eventually this
        leads to faillures. My job does some caching, but I
        understand that when the job ends successfully, then the
        sparkcontext is destroyed and the cache should be cleared.
        However it seems that something keeps on filling the memory
        a bit more and more after each run. THis is the memory
        behaviour over time, which in the end will start leading to
        failures :

        (what we see is: green=physical memory used,
        green-blue=physical memory cached, grey=memory capacity
        =straight line around 31GB )
        This runs on a healthy spark 2.4 and was optimized already
        to come to a stable job in terms of spark-submit resources
        parameters like
        
driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
        Any clue how to “really” clear the memory in between jobs?
        So basically currently I can loop 10x and then need to
        restart my cluster so all memory is cleared completely.


        Thanks for any info!

    <Screenshot 2022-03-30 at 15.28.24.png>
--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

Reply via email to