Thanks for the report.  We improved the speed here in 1.3.1 so would be
interesting to know if this helps.  You should also try disabling schema
merging if you do not need that feature (i.e. all of your files are the
same schema).

sqlContext.load("path", "parquet", Map("mergeSchema" -> "false"))

On Wed, Apr 8, 2015 at 7:35 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1
>
> Cheers
>
> On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom <
> eric.eijkelenb...@gmail.com> wrote:
>
>> Hi guys
>>
>> *I’ve got:*
>>
>>    - 180 days of log data in Parquet.
>>    - Each day is stored in a separate folder in S3.
>>    - Each day consists of 20-30 Parquet files of 256 MB each.
>>    - Spark 1.3 on Amazon EMR
>>
>> This makes approximately 5000 Parquet files with a total size if 1.5 TB.
>>
>> *My code*:
>> val in = sqlContext.parquetFile(“day1”, “day2”, …, “day180”)
>>
>> *Problem*:
>> Before the very first stage is started, Spark spends about 25 minutes
>> printing the following:
>> ...
>> 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key
>> 'logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000059' for
>> reading at position '258305902'
>> 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key
>> 'logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000072'
>> for reading at position '260897108'
>> 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening '
>> s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124'
>> for reading
>> 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening key
>> 'logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124' for
>> reading at position '261259189'
>> 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
>> s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=10/dd=15/bc9c8fdf-dc67-441a-8eda-9a06f032158f-000102'
>> for reading
>> 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
>> s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000060'
>> for reading
>> 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening '
>> s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000073'
>> for reading
>> … etc
>>
>> It looks like Spark is opening each file, before it actually does any
>> work. This means a delay of 25 minutes when working with Parquet files.
>> Previously, we used LZO files and did not experience this problem.
>>
>> *Bonus info: *
>> This also happens when I use auto partition discovery (i.e.
>> sqlContext.parquetFile(“/path/to/logsroot/")).
>>
>> What can I do to avoid this?
>>
>> Thanks in advance!
>>
>> Eric Eijkelenboom
>>
>>
>

Reply via email to