Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-19 Thread Jerry Lam
Hi guys,

Does this issue affect 1.2.0 only or all previous releases as well?

Best Regards,

Jerry


On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao xuelincao2...@gmail.com wrote:


 Yes, the problem is, I've turned the flag on.

 One possible reason for this is, the parquet file supports predicate
 pushdown by setting statistical min/max value of each column on parquet
 blocks. If in my test, the groupID=10113000 is scattered in all parquet
 blocks, then the predicate pushdown fails.

 But, I'm not quite sure about that. I don't know whether there is any
 other reason that can lead to this.


 On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger c...@koeninger.org
 wrote:

 But Xuelin already posted in the original message that the code was using

 SET spark.sql.parquet.filterPushdown=true

 On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv danielru...@gmail.com
 wrote:

 Quoting Michael:
 Predicate push down into the input format is turned off by default
 because there is a bug in the current parquet library that null pointers
 when there are full row groups that are null.

 https://issues.apache.org/jira/browse/SPARK-4258

 You can turn it on if you want:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

 Daniel

 On 7 בינו׳ 2015, at 08:18, Xuelin Cao xuelin...@yahoo.com.INVALID
 wrote:


 Hi,

I'm testing parquet file format, and the predicate pushdown is a
 very useful feature for us.

However, it looks like the predicate push down doesn't work after
 I set
sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)

Here is my sql:
*sqlContext.sql(select adId, adTitle  from ad where
 groupId=10113000).collect*

Then, I checked the amount of input data on the WEB UI. But the
 amount of input data is ALWAYS 80.2M regardless whether I turn the 
 spark.sql.parquet.filterPushdown
 flag on or off.

I'm not sure, if there is anything that I must do when *generating
 *the parquet file in order to make the predicate pushdown available.
 (Like ORC file, when creating the ORC file, I need to explicitly sort the
 field that will be used for predicate pushdown)

Anyone have any idea?

And, anyone knows the internal mechanism for parquet predicate
 pushdown?

Thanks








Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-17 Thread Yana Kadiyska
Just wondering if you've made any progress on this -- I'm having the same
issue. My attempts to help myself are documented here
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E
.

I don't believe I have the value scattered through all blocks issue either
as running with
sc.hadoopConfiguration.set(parquet.task.side.metadata,false) shows a
much smaller Input size for me and it is the exact same parquet files being
scanned.

On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao xuelincao2...@gmail.com wrote:


 Yes, the problem is, I've turned the flag on.

 One possible reason for this is, the parquet file supports predicate
 pushdown by setting statistical min/max value of each column on parquet
 blocks. If in my test, the groupID=10113000 is scattered in all parquet
 blocks, then the predicate pushdown fails.

 But, I'm not quite sure about that. I don't know whether there is any
 other reason that can lead to this.


 On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger c...@koeninger.org
 wrote:

 But Xuelin already posted in the original message that the code was using

 SET spark.sql.parquet.filterPushdown=true

 On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv danielru...@gmail.com
 wrote:

 Quoting Michael:
 Predicate push down into the input format is turned off by default
 because there is a bug in the current parquet library that null pointers
 when there are full row groups that are null.

 https://issues.apache.org/jira/browse/SPARK-4258

 You can turn it on if you want:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

 Daniel

 On 7 בינו׳ 2015, at 08:18, Xuelin Cao xuelin...@yahoo.com.INVALID
 wrote:


 Hi,

I'm testing parquet file format, and the predicate pushdown is a
 very useful feature for us.

However, it looks like the predicate push down doesn't work after
 I set
sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)

Here is my sql:
*sqlContext.sql(select adId, adTitle  from ad where
 groupId=10113000).collect*

Then, I checked the amount of input data on the WEB UI. But the
 amount of input data is ALWAYS 80.2M regardless whether I turn the 
 spark.sql.parquet.filterPushdown
 flag on or off.

I'm not sure, if there is anything that I must do when *generating
 *the parquet file in order to make the predicate pushdown available.
 (Like ORC file, when creating the ORC file, I need to explicitly sort the
 field that will be used for predicate pushdown)

Anyone have any idea?

And, anyone knows the internal mechanism for parquet predicate
 pushdown?

Thanks








Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-07 Thread Cody Koeninger
But Xuelin already posted in the original message that the code was using

SET spark.sql.parquet.filterPushdown=true

On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv danielru...@gmail.com wrote:

 Quoting Michael:
 Predicate push down into the input format is turned off by default because
 there is a bug in the current parquet library that null pointers when there
 are full row groups that are null.

 https://issues.apache.org/jira/browse/SPARK-4258

 You can turn it on if you want:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

 Daniel

 On 7 בינו׳ 2015, at 08:18, Xuelin Cao xuelin...@yahoo.com.INVALID wrote:


 Hi,

I'm testing parquet file format, and the predicate pushdown is a
 very useful feature for us.

However, it looks like the predicate push down doesn't work after I
 set
sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)

Here is my sql:
*sqlContext.sql(select adId, adTitle  from ad where
 groupId=10113000).collect*

Then, I checked the amount of input data on the WEB UI. But the
 amount of input data is ALWAYS 80.2M regardless whether I turn the 
 spark.sql.parquet.filterPushdown
 flag on or off.

I'm not sure, if there is anything that I must do when *generating
 *the parquet file in order to make the predicate pushdown available.
 (Like ORC file, when creating the ORC file, I need to explicitly sort the
 field that will be used for predicate pushdown)

Anyone have any idea?

And, anyone knows the internal mechanism for parquet predicate
 pushdown?

Thanks






Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-07 Thread Xuelin Cao
Yes, the problem is, I've turned the flag on.

One possible reason for this is, the parquet file supports predicate
pushdown by setting statistical min/max value of each column on parquet
blocks. If in my test, the groupID=10113000 is scattered in all parquet
blocks, then the predicate pushdown fails.

But, I'm not quite sure about that. I don't know whether there is any other
reason that can lead to this.


On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger c...@koeninger.org wrote:

 But Xuelin already posted in the original message that the code was using

 SET spark.sql.parquet.filterPushdown=true

 On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv danielru...@gmail.com
 wrote:

 Quoting Michael:
 Predicate push down into the input format is turned off by default
 because there is a bug in the current parquet library that null pointers
 when there are full row groups that are null.

 https://issues.apache.org/jira/browse/SPARK-4258

 You can turn it on if you want:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

 Daniel

 On 7 בינו׳ 2015, at 08:18, Xuelin Cao xuelin...@yahoo.com.INVALID
 wrote:


 Hi,

I'm testing parquet file format, and the predicate pushdown is a
 very useful feature for us.

However, it looks like the predicate push down doesn't work after
 I set
sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)

Here is my sql:
*sqlContext.sql(select adId, adTitle  from ad where
 groupId=10113000).collect*

Then, I checked the amount of input data on the WEB UI. But the
 amount of input data is ALWAYS 80.2M regardless whether I turn the 
 spark.sql.parquet.filterPushdown
 flag on or off.

I'm not sure, if there is anything that I must do when *generating
 *the parquet file in order to make the predicate pushdown available.
 (Like ORC file, when creating the ORC file, I need to explicitly sort the
 field that will be used for predicate pushdown)

Anyone have any idea?

And, anyone knows the internal mechanism for parquet predicate
 pushdown?

Thanks







Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Daniel Haviv
Quoting Michael:
Predicate push down into the input format is turned off by default because 
there is a bug in the current parquet library that null pointers when there are 
full row groups that are null.

https://issues.apache.org/jira/browse/SPARK-4258

You can turn it on if you want: 
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

Daniel

 On 7 בינו׳ 2015, at 08:18, Xuelin Cao xuelin...@yahoo.com.INVALID wrote:
 
 
 Hi,
 
I'm testing parquet file format, and the predicate pushdown is a very 
 useful feature for us.
 
However, it looks like the predicate push down doesn't work after I 
 set 
sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)
  
Here is my sql:
sqlContext.sql(select adId, adTitle  from ad where 
 groupId=10113000).collect
 
Then, I checked the amount of input data on the WEB UI. But the amount 
 of input data is ALWAYS 80.2M regardless whether I turn the 
 spark.sql.parquet.filterPushdown flag on or off.
 
I'm not sure, if there is anything that I must do when generating the 
 parquet file in order to make the predicate pushdown available. (Like ORC 
 file, when creating the ORC file, I need to explicitly sort the field that 
 will be used for predicate pushdown)
 
Anyone have any idea?
 
And, anyone knows the internal mechanism for parquet predicate 
 pushdown?
 
Thanks
 
  


Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Xuelin Cao

Hi,
       I'm testing parquet file format, and the predicate pushdown is a very 
useful feature for us.
       However, it looks like the predicate push down doesn't work after I set  
      sqlContext.sql(SET spark.sql.parquet.filterPushdown=true)        Here 
is my sql:       sqlContext.sql(select adId, adTitle  from ad where 
groupId=10113000).collect

       Then, I checked the amount of input data on the WEB UI. But the amount 
of input data is ALWAYS 80.2M regardless whether I turn the 
spark.sql.parquet.filterPushdown flag on or off.
       I'm not sure, if there is anything that I must do when generating the 
parquet file in order to make the predicate pushdown available. (Like ORC file, 
when creating the ORC file, I need to explicitly sort the field that will be 
used for predicate pushdown)
       Anyone have any idea?
       And, anyone knows the internal mechanism for parquet predicate pushdown?
       Thanks