Just wondering if you've made any progress on this -- I'm having the same
issue. My attempts to help myself are documented here
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E
.

I don't believe I have the value scattered through all blocks issue either
as running with
sc.hadoopConfiguration.set("parquet.task.side.metadata","false") shows a
much smaller Input size for me and it is the exact same parquet files being
scanned.

On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao <xuelincao2...@gmail.com> wrote:

>
> Yes, the problem is, I've turned the flag on.
>
> One possible reason for this is, the parquet file supports "predicate
> pushdown" by setting statistical min/max value of each column on parquet
> blocks. If in my test, the "groupID=10113000" is scattered in all parquet
> blocks, then the predicate pushdown fails.
>
> But, I'm not quite sure about that. I don't know whether there is any
> other reason that can lead to this.
>
>
> On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> But Xuelin already posted in the original message that the code was using
>>
>> SET spark.sql.parquet.filterPushdown=true
>>
>> On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv <danielru...@gmail.com>
>> wrote:
>>
>>> Quoting Michael:
>>> Predicate push down into the input format is turned off by default
>>> because there is a bug in the current parquet library that null pointers
>>> when there are full row groups that are null.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-4258
>>>
>>> You can turn it on if you want:
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration
>>>
>>> Daniel
>>>
>>> On 7 בינו׳ 2015, at 08:18, Xuelin Cao <xuelin...@yahoo.com.INVALID>
>>> wrote:
>>>
>>>
>>> Hi,
>>>
>>>        I'm testing parquet file format, and the predicate pushdown is a
>>> very useful feature for us.
>>>
>>>        However, it looks like the predicate push down doesn't work after
>>> I set
>>>        sqlContext.sql("SET spark.sql.parquet.filterPushdown=true")
>>>
>>>        Here is my sql:
>>>        *sqlContext.sql("select adId, adTitle  from ad where
>>> groupId=10113000").collect*
>>>
>>>        Then, I checked the amount of input data on the WEB UI. But the
>>> amount of input data is ALWAYS 80.2M regardless whether I turn the 
>>> spark.sql.parquet.filterPushdown
>>> flag on or off.
>>>
>>>        I'm not sure, if there is anything that I must do when *generating
>>> *the parquet file in order to make the predicate pushdown available.
>>> (Like ORC file, when creating the ORC file, I need to explicitly sort the
>>> field that will be used for predicate pushdown)
>>>
>>>        Anyone have any idea?
>>>
>>>        And, anyone knows the internal mechanism for parquet predicate
>>> pushdown?
>>>
>>>        Thanks
>>>
>>>
>>>
>>>
>>
>

Reply via email to