Re: Opening many Parquet files = slow

Ted Yu Wed, 08 Apr 2015 07:37:40 -0700

You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1


Cheers

On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom <
eric.eijkelenb...@gmail.com> wrote:

> Hi guys
>
> *I’ve got:*
>
>    - 180 days of log data in Parquet.
>    - Each day is stored in a separate folder in S3.
>    - Each day consists of 20-30 Parquet files of 256 MB each.
>    - Spark 1.3 on Amazon EMR
>
> This makes approximately 5000 Parquet files with a total size if 1.5 TB.
>
> *My code*:
> val in = sqlContext.parquetFile(“day1”, “day2”, …, “day180”)
>
> *Problem*:
> Before the very first stage is started, Spark spends about 25 minutes
> printing the following:
> ...
> 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key
> 'logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000059' for
> reading at position '258305902'
> 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key
> 'logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000072'
> for reading at position '260897108'
> 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening '
> s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124'
> for reading
> 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening key
> 'logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124' for
> reading at position '261259189'
> 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
> s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=10/dd=15/bc9c8fdf-dc67-441a-8eda-9a06f032158f-000102'
> for reading
> 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
> s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000060'
> for reading
> 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
> s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000073'
> for reading
> … etc
>
> It looks like Spark is opening each file, before it actually does any
> work. This means a delay of 25 minutes when working with Parquet files.
> Previously, we used LZO files and did not experience this problem.
>
> *Bonus info: *
> This also happens when I use auto partition discovery (i.e.
> sqlContext.parquetFile(“/path/to/logsroot/")).
>
> What can I do to avoid this?
>
> Thanks in advance!
>
> Eric Eijkelenboom
>
>

Re: Opening many Parquet files = slow

Reply via email to