passing clustering spec to datasource v2

suds Tue, 26 Nov 2019 11:18:22 -0800

I looked at open issue and discussion around sort spec
https://github.com/apache/incubator-iceberg/issues/317


for now we have added sort spec external to iceberg and made it work by
adding additional logic to sort dataframe before writing to iceberg table (
its a hack until above issue gets resolved)

I am trying to see if I can use sorted data to some how hint join operation
that data is presorted.

v1 datasource has ability to pass bucketSpec and Hive and spark bucked
table use this feature , so that join operation can use sortmerge join and
no additional sort step is needed.

class HadoopFsRelation(
    location: FileIndex,
    partitionSchema: StructType,
    dataSchema: StructType,
    bucketSpec: Option[BucketSpec],
    fileFormat: FileFormat,
    options: Map[String, String])(val sparkSession: SparkSession)
  extends BaseRelation with FileRelation


does anyone on this forum looked into V2 api and how similar hint can be
passed? I can work on creating proof of concept PR for sort spec but I am
not able to find support for sort spec in V2 api.

I also tried to use another hack using following code which seems to show
sortMergeJoin is used but for some reason sort within partition is taking
too long ( assuming spark uses timsort I was expecting it to be no-op)

val df1 = readIcebergTable("table1").sortWithinPartitions(col("col1")).cache()

val df2 = readIcebergTable("table2").sortWithinPartitions(col("col1")).cache()

val finalDF = df1.join(df2, df1("col1") === df2("col1"))


Any suggestions to make join work without additional sort?


--
Thanks

passing clustering spec to datasource v2

Reply via email to