Had a high level look into the code. Seems getHiveQlPartitions method from HiveMetastoreCatalog is getting called irrespective of metastorePartitionPruning conf value.
It should not fetch all partitions if we set metastorePartitionPruning to true (Default value for this is false) def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = { val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) { table.getPartitions(predicates) } else { allPartitions } ... def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] = client.getPartitionsByFilter(this, predicates) lazy val allPartitions = table.getAllPartitions But somehow getAllPartitions is getting called eventough after setting metastorePartitionPruning to true. Am I missing something or looking at wrong place? On Tue, Jan 17, 2017 at 4:01 PM, Raju Bairishetti <r...@apache.org> wrote: > Hello, > > Spark sql is generating query plan with all partitions information > even though if we apply filters on partitions in the query. Due to this, > sparkdriver/hive metastore is hitting with OOM as each table is with lots > of partitions. > > We can confirm from hive audit logs that it tries to > *fetch all partitions* from hive metastore. > > 2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: HiveMetaStore.audit > (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub ip=/x.x.x.x > cmd=get_partitions : db=xxxx tbl=xxxxx > > > Configured the following parameters in the spark conf to fix the above > issue(source: from spark-jira & github pullreq): > > *spark.sql.hive.convertMetastoreParquet false* > * spark.sql.hive.metastorePartitionPruning true* > > > * plan: rdf.explain* > * == Physical Plan ==* > HiveTableScan [rejection_reason#626], MetastoreRelation dbname, > tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = > 28),(hour#317 = 2),(venture#318 = DEFAULT)] > > * get_partitions_by_filter* method is called and fetching only > required partitions. > > But we are seeing parquetDecode errors in our applications frequently > after this. Looks like these decoding errors were because of changing > serde fromspark-builtin to hive serde. > > I feel like,* fixing query plan generation in the spark-sql* is the right > approach instead of forcing users to use hive serde. > > Is there any workaround/way to fix this issue? I would like to hear more > thoughts on this :) > > > On Tue, Jan 17, 2017 at 4:00 PM, Raju Bairishetti <r...@apache.org> wrote: > >> Had a high level look into the code. Seems getHiveQlPartitions method >> from HiveMetastoreCatalog is getting called irrespective of >> metastorePartitionPruning >> conf value. >> >> It should not fetch all partitions if we set metastorePartitionPruning to >> true (Default value for this is false) >> >> def getHiveQlPartitions(predicates: Seq[Expression] = Nil): Seq[Partition] = >> { >> val rawPartitions = if (sqlContext.conf.metastorePartitionPruning) { >> table.getPartitions(predicates) >> } else { >> allPartitions >> } >> >> ... >> >> def getPartitions(predicates: Seq[Expression]): Seq[HivePartition] = >> client.getPartitionsByFilter(this, predicates) >> >> lazy val allPartitions = table.getAllPartitions >> >> But somehow getAllPartitions is getting called eventough after setting >> metastorePartitionPruning to true. >> >> Am I missing something or looking at wrong place? >> >> >> On Mon, Jan 16, 2017 at 12:53 PM, Raju Bairishetti <r...@apache.org> >> wrote: >> >>> Waiting for suggestions/help on this... >>> >>> On Wed, Jan 11, 2017 at 12:14 PM, Raju Bairishetti <r...@apache.org> >>> wrote: >>> >>>> Hello, >>>> >>>> Spark sql is generating query plan with all partitions information >>>> even though if we apply filters on partitions in the query. Due to this, >>>> spark driver/hive metastore is hitting with OOM as each table is with lots >>>> of partitions. >>>> >>>> We can confirm from hive audit logs that it tries to *fetch all >>>> partitions* from hive metastore. >>>> >>>> 2016-12-28 07:18:33,749 INFO [pool-4-thread-184]: HiveMetaStore.audit >>>> (HiveMetaStore.java:logAuditEvent(371)) - ugi=rajub ip=/x.x.x.x >>>> cmd=get_partitions : db=xxxx tbl=xxxxx >>>> >>>> >>>> Configured the following parameters in the spark conf to fix the above >>>> issue(source: from spark-jira & github pullreq): >>>> >>>> *spark.sql.hive.convertMetastoreParquet false* >>>> * spark.sql.hive.metastorePartitionPruning true* >>>> >>>> >>>> * plan: rdf.explain* >>>> * == Physical Plan ==* >>>> HiveTableScan [rejection_reason#626], MetastoreRelation dbname, >>>> tablename, None, [(year#314 = 2016),(month#315 = 12),(day#316 = >>>> 28),(hour#317 = 2),(venture#318 = DEFAULT)] >>>> >>>> * get_partitions_by_filter* method is called and fetching only >>>> required partitions. >>>> >>>> But we are seeing parquetDecode errors in our applications >>>> frequently after this. Looks like these decoding errors were because of >>>> changing serde from spark-builtin to hive serde. >>>> >>>> I feel like,* fixing query plan generation in the spark-sql* is the >>>> right approach instead of forcing users to use hive serde. >>>> >>>> Is there any workaround/way to fix this issue? I would like to hear >>>> more thoughts on this :) >>>> >>>> ------ >>>> Thanks, >>>> Raju Bairishetti, >>>> www.lazada.com >>>> >>> >>> >>> >>> -- >>> >>> ------ >>> Thanks, >>> Raju Bairishetti, >>> www.lazada.com >>> >> >> >> >> -- >> >> ------ >> Thanks, >> Raju Bairishetti, >> www.lazada.com >> > > > > -- > > ------ > Thanks, > Raju Bairishetti, > www.lazada.com > -- ------ Thanks, Raju Bairishetti, www.lazada.com