Hi guys I’ve got: 180 days of log data in Parquet. Each day is stored in a separate folder in S3. Each day consists of 20-30 Parquet files of 256 MB each. Spark 1.3 on Amazon EMR This makes approximately 5000 Parquet files with a total size if 1.5 TB.
My code: val in = sqlContext.parquetFile(“day1”, “day2”, …, “day180”) Problem: Before the very first stage is started, Spark spends about 25 minutes printing the following: ... 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key 'logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000059' for reading at position '258305902' 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key 'logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000072' for reading at position '260897108' 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening 's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124' for reading 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening key 'logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124' for reading at position '261259189' 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening 's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=10/dd=15/bc9c8fdf-dc67-441a-8eda-9a06f032158f-000102' for reading 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening 's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000060' for reading 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening 's3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000073' for reading … etc It looks like Spark is opening each file, before it actually does any work. This means a delay of 25 minutes when working with Parquet files. Previously, we used LZO files and did not experience this problem. Bonus info: This also happens when I use auto partition discovery (i.e. sqlContext.parquetFile(“/path/to/logsroot/")). What can I do to avoid this? Thanks in advance! Eric Eijkelenboom