Hi, >>review_date(Part -review_date) field not found in record
Seems like the precombine field is not in the input DF? Can you try doing df1.printSchema and check that once? On Sat, Mar 9, 2019 at 11:52 AM Umesh Kacha <[email protected]> wrote: > Hi I have the following code using which I am trying to bulk insert huge > csv file loaded into Spark DataFrame but it fails saying column review_date > not found but that column is definitely there in dataframe. Please guide. > > df1.write > .format("com.uber.hoodie") > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > HoodieTableType.COPY_ON_WRITE.name()) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) // insert > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, > "customer_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "year") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, > "review_date") > .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test_table") > .mode(SaveMode.Overwrite) > .save("/tmp/hoodie/test_hoodie") > > > Caused by: com.uber.hoodie.exception.HoodieException: review_date(Part > -review_date) field not found in record. Acceptable fields were > :[marketplace, customer_id, review_id, product_id, product_parent, > product_title, product_category, star_rating, helpful_votes, total_votes, > vine, verified_purchase, review_headline, review_body, review_date, year] > at > > com.uber.hoodie.DataSourceUtils.getNestedFieldValAsString(DataSourceUtils.java:79) > at > > com.uber.hoodie.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:93) > at > > com.uber.hoodie.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:92) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at > > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at > org.apache.spark.scheduler.Task.run(Task.scala:112) at > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Command took 4.45 seconds -- by [email protected] at 3/10/2019, > 1:17:42 > AM on Spark_Hudi > Shift+Enter to run shortcuts >
