Re: Apache Spark orc read performance when reading large number of small files

Jörn Franke Thu, 01 Nov 2018 00:20:45 -0700

A lot of small files is very inefficient itself and predicate push down will 
not help you much there unless you merge them into one large file (one large 
file can be much more efficiently processed).


How did you validate that predicate pushdown did not work on Hive? You Hive 
Version is also very old - consider upgrading to at least Hive 2.x

> Am 31.10.2018 um 20:35 schrieb gpatcham <gpatc...@gmail.com>:
> 
> spark version 2.2.0
> Hive version 1.1.0
> 
> There are lot of small files
> 
> Spark code :
> 
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true 
> 
> val logs
> =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date >
> 20181003")
> 
> Hive:
> 
> "spark.sql.orc.enabled": "true",
> "spark.sql.orc.filterPushdown": "true 
> 
> test  table in Hive is pointing to hdfs://test/  and partitioned on date
> 
> val sqlStr = s"select * from test where date > 20181001"
> val logs = spark.sql(sqlStr)
> 
> With Hive query I don't see filter pushdown is  happening. I tried setting
> these configs in both hive-site.xml and also spark.sqlContext.setConf
> 
> "hive.optimize.ppd":"true",
> "hive.optimize.ppd.storage":"true" 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Apache Spark orc read performance when reading large number of small files

Reply via email to