Just wondering if you've made any progress on this -- I'm having the same issue. My attempts to help myself are documented here http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E .
I don't believe I have the value scattered through all blocks issue either as running with sc.hadoopConfiguration.set("parquet.task.side.metadata","false") shows a much smaller Input size for me and it is the exact same parquet files being scanned. On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao <xuelincao2...@gmail.com> wrote: > > Yes, the problem is, I've turned the flag on. > > One possible reason for this is, the parquet file supports "predicate > pushdown" by setting statistical min/max value of each column on parquet > blocks. If in my test, the "groupID=10113000" is scattered in all parquet > blocks, then the predicate pushdown fails. > > But, I'm not quite sure about that. I don't know whether there is any > other reason that can lead to this. > > > On Wed, Jan 7, 2015 at 10:14 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> But Xuelin already posted in the original message that the code was using >> >> SET spark.sql.parquet.filterPushdown=true >> >> On Wed, Jan 7, 2015 at 12:42 AM, Daniel Haviv <danielru...@gmail.com> >> wrote: >> >>> Quoting Michael: >>> Predicate push down into the input format is turned off by default >>> because there is a bug in the current parquet library that null pointers >>> when there are full row groups that are null. >>> >>> https://issues.apache.org/jira/browse/SPARK-4258 >>> >>> You can turn it on if you want: >>> http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration >>> >>> Daniel >>> >>> On 7 בינו׳ 2015, at 08:18, Xuelin Cao <xuelin...@yahoo.com.INVALID> >>> wrote: >>> >>> >>> Hi, >>> >>> I'm testing parquet file format, and the predicate pushdown is a >>> very useful feature for us. >>> >>> However, it looks like the predicate push down doesn't work after >>> I set >>> sqlContext.sql("SET spark.sql.parquet.filterPushdown=true") >>> >>> Here is my sql: >>> *sqlContext.sql("select adId, adTitle from ad where >>> groupId=10113000").collect* >>> >>> Then, I checked the amount of input data on the WEB UI. But the >>> amount of input data is ALWAYS 80.2M regardless whether I turn the >>> spark.sql.parquet.filterPushdown >>> flag on or off. >>> >>> I'm not sure, if there is anything that I must do when *generating >>> *the parquet file in order to make the predicate pushdown available. >>> (Like ORC file, when creating the ORC file, I need to explicitly sort the >>> field that will be used for predicate pushdown) >>> >>> Anyone have any idea? >>> >>> And, anyone knows the internal mechanism for parquet predicate >>> pushdown? >>> >>> Thanks >>> >>> >>> >>> >> >