Hi guys

I’ve got:
180 days of log data in Parquet.
Each day is stored in a separate folder in S3.
Each day consists of 20-30 Parquet files of 256 MB each.
Spark 1.3 on Amazon EMR
This makes approximately 5000 Parquet files with a total size if 1.5 TB.

My code: 
val in = sqlContext.parquetFile(“day1”, “day2”, …, “day180”)

Problem: 
Before the very first stage is started, Spark spends about 25 minutes printing 
the following:
...
15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key 
'logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000059' for 
reading at position '258305902'
15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key 
'logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000072' for 
reading at position '260897108'
15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening 
's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124'
 for reading
15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening key 
'logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124' for 
reading at position '261259189'
15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening 
's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=10/dd=15/bc9c8fdf-dc67-441a-8eda-9a06f032158f-000102'
 for reading
15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening 
's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000060'
 for reading
15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening 
's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000073'
 for reading
… etc

It looks like Spark is opening each file, before it actually does any work. 
This means a delay of 25 minutes when working with Parquet files. Previously, 
we used LZO files and did not experience this problem.

Bonus info: 
This also happens when I use auto partition discovery (i.e. 
sqlContext.parquetFile(“/path/to/logsroot/")).

What can I do to avoid this? 

Thanks in advance! 

Eric Eijkelenboom

Reply via email to