Re: Opening many Parquet files = slow

Cheng Lian Wed, 08 Apr 2015 19:57:07 -0700

Hi Eric - Would you mind to try either disabling schema merging as whatMichael suggested, or disabling the new Parquet data source by


sqlContext.setConf("spark.sql.parquet.useDataSourceApi", "false")


Cheng

On 4/9/15 2:43 AM, Michael Armbrust wrote:

Thanks for the report. We improved the speed here in 1.3.1 so wouldbe interesting to know if this helps. You should also try disablingschema merging if you do not need that feature (i.e. all of your filesare the same schema).


sqlContext.load("path", "parquet", Map("mergeSchema" -> "false"))

On Wed, Apr 8, 2015 at 7:35 AM, Ted Yu <yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote:


    You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1

    Cheers

    On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom
    <eric.eijkelenb...@gmail.com <mailto:eric.eijkelenb...@gmail.com>>
    wrote:

        Hi guys

        *I’ve got:*

          * 180 days of log data in Parquet.
          * Each day is stored in a separate folder in S3.
          * Each day consists of 20-30 Parquet files of 256 MB each.
          * Spark 1.3 on Amazon EMR

        This makes approximately 5000 Parquet files with a total size
        if 1.5 TB.

        *My code*:
        val in = sqlContext.parquetFile(“day1”, “day2”, …, “day180”)

        *Problem*:
        Before the very first stage is started, Spark spends about 25
        minutes printing the following:
        ...
        15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening
        key
        'logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000059'
        for reading at position '258305902'
        15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening
        key
        'logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000072'
        for reading at position '260897108'
        15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening
        
's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124'
        for reading
        15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening
        key
        'logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124'
        for reading at position '261259189'
        15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening
        
's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=10/dd=15/bc9c8fdf-dc67-441a-8eda-9a06f032158f-000102'
        for reading
        15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening
        
's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000060'
        for reading
        15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening
        
's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000073'
        for reading
        … etc

        It looks like Spark is opening each file, before it actually
        does any work. This means a delay of 25 minutes when working
        with Parquet files. Previously, we used LZO files and did not
        experience this problem.

        *Bonus info: *
        This also happens when I use auto partition discovery (i.e.
        sqlContext.parquetFile(“/path/to/logsroot/")).

        What can I do to avoid this?

        Thanks in advance!

        Eric Eijkelenboom

Re: Opening many Parquet files = slow

Reply via email to