Hi I have the following code using which I am trying to bulk insert huge
csv file loaded into Spark DataFrame but it fails saying column review_date
not found but that column is definitely there in dataframe. Please guide.
df1.write
.format("com.uber.hoodie")
.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
HoodieTableType.COPY_ON_WRITE.name())
.option(DataSourceWriteOptions.OPERATION_OPT_KEY,
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) // insert
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "customer_id")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "year")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY,
"review_date")
.option(HoodieWriteConfig.TABLE_NAME, "hoodie_test_table")
.mode(SaveMode.Overwrite)
.save("/tmp/hoodie/test_hoodie")
Caused by: com.uber.hoodie.exception.HoodieException: review_date(Part
-review_date) field not found in record. Acceptable fields were
:[marketplace, customer_id, review_id, product_id, product_parent,
product_title, product_category, star_rating, helpful_votes, total_votes,
vine, verified_purchase, review_headline, review_body, review_date, year]
at
com.uber.hoodie.DataSourceUtils.getNestedFieldValAsString(DataSourceUtils.java:79)
at
com.uber.hoodie.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:93)
at
com.uber.hoodie.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:92)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at
org.apache.spark.scheduler.Task.run(Task.scala:112) at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Command took 4.45 seconds -- by [email protected] at 3/10/2019, 1:17:42
AM on Spark_Hudi
Shift+Enter to run shortcuts