[
https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830973#comment-17830973
]
Vinaykumar Bhat edited comment on HUDI-7117 at 3/26/24 2:54 PM:
----------------------------------------------------------------
This is likely not an issue, but a gap in understanding the feature.
The issue is that
{{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}}
creates a temporary view (similar to the one that is created using {{{}CREATE
TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table.
Hence the following {{CREATE INDEX ...}} statement to create a functional fails
as the object on which the index is being created is not a hudi managed table.
Instead of creating a temporary view, one can use {{saveAsTable(...)}} method
on the DataFrameWriter object to create a hudi managed table and then create
functional index on those tables. An example follows:
{code:java}
val columns = Seq("ts", "transaction_id", "rider", "driver", "price",
"location")
val data = Seq(
(1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", "driver-K",
19.10, "san_francisco"),
(1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", "driver-M",
27.70, "san_francisco"),
(1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", "driver-L",
33.90, "san_francisco"),
(1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", "driver-P",
34.15, "sao_paulo"),
(1695115999911L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", "driver-T",
17.85, "chennai"));
var inserts = spark.createDataFrame(data).toDF(columns: _*)
inserts.write.format("hudi").
option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location").
option(HoodieWriteConfig.TABLE_NAME, tableName).
option("hoodie.datasource.write.operation", "upsert").
option("hoodie.datasource.write.recordkey.field", "transaction_id").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.table.type",
HoodieTableType.COPY_ON_WRITE.name()).
option("hoodie.table.metadata.enable", "true").
option("hoodie.parquet.small.file.limit", "0").
option("path", "/tmp/temp_table_path/").
mode(SaveMode.Append).
saveAsTable("temp_table")
spark.catalog.listTables().show(false)
spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')")
{code}
was (Author: JIRAUSER303569):
This is likely not an issue, but a gap in understanding the feature.
The issue is that
{{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}}
creates a temporary view (similar to the one that is created using {{{}CREATE
TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table.
Hence the following {{CREATE INDEX ...}} statement to create a functional fails
as the object on which the index is being created is not a hudi managed table.
Instead of creating a temporary view, one can use {{saveAsTable(...)}} method
on the DataFrameWriter object to create a hudi managed table and then create
functional index on those tables. An example follows:
val columns = Seq("ts", "transaction_id", "rider", "driver", "price",
"location")
val data =
Seq((1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A",
"driver-K", 19.10, "san_francisco"),
(1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C",
"driver-M", 27.70, "san_francisco"),
(1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D",
"driver-L", 33.90, "san_francisco"),
(1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F",
"driver-P", 34.15, "sao_paulo"),
(1695115999911L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J",
"driver-T", 17.85, "chennai"));
var inserts = spark.createDataFrame(data).toDF(columns: _*)
inserts.write.format("hudi").
option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location").
option(HoodieWriteConfig.TABLE_NAME, tableName).
option("hoodie.datasource.write.operation", "upsert").
option("hoodie.datasource.write.recordkey.field", "transaction_id").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.datasource.write.table.type",
HoodieTableType.COPY_ON_WRITE.name()).
option("hoodie.table.metadata.enable", "true").
option("hoodie.parquet.small.file.limit", "0").
option("path", "/tmp/temp_table_path/").
mode(SaveMode.Append).
saveAsTable("temp_table")
spark.catalog.listTables().show(false)
spark.sql(s"select from_unixtime(ts, 'yyyy-MM-dd') as datestr FROM
temp_table").show()
spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')")
> Functional index creation not working when table is created using datasource
> writer
> -----------------------------------------------------------------------------------
>
> Key: HUDI-7117
> URL: https://issues.apache.org/jira/browse/HUDI-7117
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Reporter: Aditya Goenka
> Assignee: Vinaykumar Bhat
> Priority: Blocker
> Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Details and Reproducible code under Github Issue -
> [https://github.com/apache/hudi/issues/10110]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)