You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1
Cheers On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom < eric.eijkelenb...@gmail.com> wrote: > Hi guys > > *I’ve got:* > > - 180 days of log data in Parquet. > - Each day is stored in a separate folder in S3. > - Each day consists of 20-30 Parquet files of 256 MB each. > - Spark 1.3 on Amazon EMR > > This makes approximately 5000 Parquet files with a total size if 1.5 TB. > > *My code*: > val in = sqlContext.parquetFile(“day1”, “day2”, …, “day180”) > > *Problem*: > Before the very first stage is started, Spark spends about 25 minutes > printing the following: > ... > 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key > 'logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000059' for > reading at position '258305902' > 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening key > 'logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000072' > for reading at position '260897108' > 15/04/08 13:00:26 INFO s3native.NativeS3FileSystem: Opening ' > s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124' > for reading > 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening key > 'logs/yyyy=2014/mm=11/dd=2/b0752022-3cd0-47e5-9e67-6ad84543e84b-000124' for > reading at position '261259189' > 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening ' > s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=10/dd=15/bc9c8fdf-dc67-441a-8eda-9a06f032158f-000102' > for reading > 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening ' > s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=8/dd=30/1b1d213d-101d-45a7-85ff-f1c573bf5ffd-000060' > for reading > 15/04/08 13:00:27 INFO s3native.NativeS3FileSystem: Opening ' > s3n://adt-timelord-daily-logs-pure/logs/yyyy=2014/mm=11/dd=22/2dbc904b-b341-4e9f-a9bb-5b5933c2e101-000073' > for reading > … etc > > It looks like Spark is opening each file, before it actually does any > work. This means a delay of 25 minutes when working with Parquet files. > Previously, we used LZO files and did not experience this problem. > > *Bonus info: * > This also happens when I use auto partition discovery (i.e. > sqlContext.parquetFile(“/path/to/logsroot/")). > > What can I do to avoid this? > > Thanks in advance! > > Eric Eijkelenboom > >