Yes, a single file compressed with a non-splitable compression (e.g. gzip) would have to be read by a single executor. That takes forever.

You should consider to recompress the file with a splitable compression first. You will not want to read that file more than once, so you should uncompress it only once (in order to recompress).

Enrico


Am 22.06.22 um 20:17 schrieb Sid:
Hi Enrico,

Thanks for the insights.

Could you please help me to understand with one example of compressed files where the file wouldn't be split in partitions and will put load on a single partition and might lead to OOM error?

Thanks,
Sid

On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack <i...@enrico.minack.dev> wrote:

    The RAM and disk memory consumtion depends on what you do with the
    data after reading them.

    Your particular action will read 20 lines from the first partition
    and show them. So it will not use any RAM or disk, no matter how
    large the CSV is.

    If you do a count instead of show, it will iterate over the each
    partition and return a count per partition, so no RAM here needed
    as well.

    If you do some real processing of the data, the requirement RAM
    and disk again depends on involved shuffles and intermediate
    results that need to be store in RAM or on disk.

    Enrico


    Am 22.06.22 um 14:54 schrieb Deepak Sharma:
    It will spill to disk if everything can’t be loaded in memory .


    On Wed, 22 Jun 2022 at 5:58 PM, Sid <flinkbyhe...@gmail.com> wrote:

        I have a 150TB CSV file.

        I have a total of 100 TB RAM and 100TB disk. So If I do
        something like this

        spark.read.option("header","true").csv(filepath).show(false)

        Will it lead to an OOM error since it doesn't have enough
        memory? or it will spill data onto the disk and process it?

        Thanks,
        Sid

-- Thanks
    Deepak
    www.bigdatabig.com <http://www.bigdatabig.com>
    www.keosha.net <http://www.keosha.net>


Reply via email to