[
https://issues.apache.org/jira/browse/SPARK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563756#comment-15563756
]
Tejas Patil commented on SPARK-17487:
-------------------------------------
[~rxin] : Since Spark native tables and hive tables follow a different naming
convention for bucketed files, I want to make Spark produce Hive compatible
file names if the output is a hive bucketed table. Currently, the read and
write code paths assume that everything is as per Spark's native bucketing
scheme eg. [1]. I think after SPARK-16879, its possible to recognise if a given
`CatalogTable` corresponds to a hive table or not [0]. So, the change here
would be to propagate that information downstream where :
- planning is done to assign splits for tasks to read. Each task should process
a single bucket and for join same bucket of the 2 tables should be read by a
given task.
- when final output files are being written out.
Similar stuff would be needed for hashing. I will avoid auto-detecting hash and
bucketing scheme based on the actual filename and instead rely on which
`Catalog` a table comes from as it seems more robust.
[0] :
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L137
[1] :
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BucketingUtils.scala#L31
> Configurable bucketing info extraction
> --------------------------------------
>
> Key: SPARK-17487
> URL: https://issues.apache.org/jira/browse/SPARK-17487
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Tejas Patil
> Priority: Minor
>
> Spark's uses a specific way to name bucketed files which is different from
> Hive's bucketed file naming scheme. For making Spark support bucketing for
> Hive tables, there needs to be a pluggable way to switch across these naming
> schemes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]