[ 
https://issues.apache.org/jira/browse/SPARK-10673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005249#comment-15005249
 ] 

Xin Wu edited comment on SPARK-10673 at 11/14/15 8:19 AM:
----------------------------------------------------------

if the default is false,
{code}
if (!sc.conf.verifyPartitionPath) {
    partitionToDeserializer
} 
{code}
will not get into the code path you mentioned. 

What the problem is that when the property is set to true, then, it gets into 
the code path that potentially evaluates all partitions of the table that 
matches the pathPatternStr. However, the pathPatternStr is computed as 
"/pathToTable/\*/\*/.." depending on the number of partition columns.  
Basically, what it means is to validate the desired partition path against all 
existing partition paths, including nested directories, which may be a lot.. 

So to avoid this potential performance issue.. I think we maybe able to simply 
the code in the else block of function verifyPartitionPath(). 
I am working on a fix.  



was (Author: xwu0226):
if the default is false,
{code}
if (!sc.conf.verifyPartitionPath) {
    partitionToDeserializer
} 
{code}
will not get into the code path you mentioned. 

What the problem is that when the property is set to true, then, it gets into 
the code path that potentially evaluates all partitions of the table that 
matches the pathPatternStr. However, the pathPatternStr is computed as 
"/pathToTable/*/*/.." depending on the number of partition columns.  Basically, 
what it means is to validate the desired partition path against all existing 
partition paths, including nested directories, which may be a lot.. 

So to avoid this potential performance issue.. I think we maybe able to simply 
the code in the else block of function verifyPartitionPath(). 
I am working on a fix.  


> spark.sql.hive.verifyPartitionPath Attempts to Verify Unregistered Partitions
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-10673
>                 URL: https://issues.apache.org/jira/browse/SPARK-10673
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0, 1.5.0
>            Reporter: Miklos Christine
>            Priority: Minor
>
> In Spark 1.4, spark.sql.hive.verifyPartitionPath was set to true by default. 
> In Spark 1.5, it is now set to false by default. 
> If a table has a lot of partitions in the underlying filesystem, the code 
> unnecessarily checks for all the underlying directories when executing a 
> query. 
> https://github.com/apache/spark/blob/v1.5.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L162
> Structure:
> {code}
> /user/hive/warehouse/table1/year=2015/month=01/
> /user/hive/warehouse/table1/year=2015/month=02/
> /user/hive/warehouse/table1/year=2015/month=03/
> ...
> /user/hive/warehouse/table1/year=2014/month=01/
> /user/hive/warehouse/table1/year=2014/month=02/
> {code}
> If the registered partitions only contain year=2015 when you run "show 
> partitions table1", this code path checks for all directories under the 
> table's root directory. This incurs a significant performance penalty if 
> there are a lot of partition directories. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to