Upgrade to parquet 1.6.0

2015-06-12 Thread Eric Eijkelenboom
Hi What is the reason that Spark still comes with Parquet 1.6.0rc3? It seems like newer Parquet versions are available (e.g. 1.6.0). This would fix problems with ‘spark.sql.parquet.filterPushdown’, which currently is disabled by default, because of a bug in Parquet 1.6.0rc3. Thanks! Eric

Re: Upgrade to parquet 1.6.0

2015-06-12 Thread Eric Eijkelenboom
to 1.7.0 (which is exactly the same as 1.6.0 with package name renamed from com.twitter to org.apache.parquet) on master branch recently. Cheng On 6/12/15 6:16 PM, Eric Eijkelenboom wrote: Hi What is the reason that Spark still comes with Parquet 1.6.0rc3? It seems like newer Parquet

Re: Parquet number of partitions

2015-05-07 Thread Eric Eijkelenboom
in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x= number of partitions. Depend on your case, repartition could be a heavy task Regards. Miguel. On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom eric.eijkelenb...@gmail.com mailto:eric.eijkelenb

Parquet number of partitions

2015-05-05 Thread Eric Eijkelenboom
Hello guys Q1: How does Spark determine the number of partitions when reading a Parquet file? val df = sqlContext.parquetFile(path) Is it some way related to the number of Parquet row groups in my input? Q2: How can I reduce this number of partitions? Doing this: df.rdd.coalesce(200).count

Re: Opening many Parquet files = slow

2015-04-13 Thread Eric Eijkelenboom
. sqlContext.load() took about 30 minutes for 5000 Parquet files on S3, the same as 1.3.0. Any help would be greatly appreciated! Thanks a lot. Eric On 10 Apr 2015, at 16:46, Eric Eijkelenboom eric.eijkelenb...@gmail.com wrote: Hi Ted Ah, I guess the term ‘source’ confused me :) Doing

Opening many Parquet files = slow

2015-04-08 Thread Eric Eijkelenboom
minutes when working with Parquet files. Previously, we used LZO files and did not experience this problem. Bonus info: This also happens when I use auto partition discovery (i.e. sqlContext.parquetFile(“/path/to/logsroot/)). What can I do to avoid this? Thanks in advance! Eric Eijkelenboom