[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272002#comment-14272002
 ] 

Yin Huai commented on SPARK-5182:
---------------------------------

Here is the doc from [~marmbrus].

Partitioning data by one or more columns is a very important optimization for 
many analytic workloads.  Right now, the implementation of partitioning in the 
Data Sources API suffers from several shortcomings.
First, each data source must implement the support on its own leading to code 
duplication.  This duplication applies both to the code of discovering / 
cataloging partitions, but also to the code required to evaluate predicates 
against a given partitions. 
Second, only a limited set of predicates are pushed down and so partitioning 
misses opportunities to prune.  While we can continue to expand the set of 
filters, however, this does not solve the problem that each data source would 
still need to implement its own version of expression evaluation for each new 
(Filter x DataType).

Requirements for the new API:
* Built in support for telling a data source which partitions it should read 
based on arbitrary predicates (including things like UDFS).
* Support for multiple levels of nested directories that store data based on 
partitioning attributes (e.g, /table/col1=a/col2=b).
* Rapid auto-discovery of large numbers of partitions.
* Discovery of partition column types using schema inference similar to JSON.
* Support for user defined partitioning schemes? (i.e. /table/2001/02/03)

Proposed interface:
{code}
case class Partition(values: Row, path: String)
case class PartitionSpec(
    partitionColumns: StructType, 
    partitions: Array[Partition])

class PartitionedRelation {
  // Has default implementation
  def parsePartitions(paths: Array[String]): PartitionSpec 

  def basePath: String

  def buildScan(
      partitions: Array[Partition], 
      requiredColumns: Array[String], 
      filters: Array[Filter]): RDD[Row]
}
{code}
Open Questions:
* Is it okay to store all of the partition metadata in-memory initially? Or 
should we consider storing this data locally to something like BDB?
* Should we be using metastore partitioning instead?


> Partitioning support for tables created by the data source API
> --------------------------------------------------------------
>
>                 Key: SPARK-5182
>                 URL: https://issues.apache.org/jira/browse/SPARK-5182
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Yin Huai
>            Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to