Sagar Sumit created HUDI-4374:
---------------------------------
Summary: Support BULK_INSERT row-writing on streaming
Dataset/DataFrame
Key: HUDI-4374
URL: https://issues.apache.org/jira/browse/HUDI-4374
Project: Apache Hudi
Issue Type: Task
Reporter: Sagar Sumit
Assignee: Sagar Sumit
Fix For: 0.12.0
With structured streaming setup, when Hudi table is written from a streaming
source, then HoodieStreamingSink calls HoodieSparkSqlWriter.write(). If
BULK_INSERT operation type is set, then HoodieSparkSqlWriter.write() internally
calls HoodieSparkSqlWriter.bulkInsertAsRow() which does a simple
df.write.format("hudi").options(...).save(). The 'write' call does not work on
streaming Dataset/DataFrame.
{code:java}
org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming
Dataset/DataFrame
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.Dataset.write(Dataset.scala:3377)
at
org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:557)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:178)
at
org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:91)
at scala.util.Try$.apply(Try.scala:213)
at
org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:90)
at org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:166)
at
org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:89)
{code}
Bulk insert can still be done by not going via the row-writing path. But, we
need to fix the HoodieStreamingSink to support bulk insert via row-writing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)