Hmmm. Thats interesting. I can see that the parsing works, since the
exception said "Part - review_date". There are definitely users who have
done this before.
So not sure what's going on.
Can you paste the generated Avro schema? following is the corresponding
code line
log.info(s"Registered avro schema : ${schema.toString(true)}")
May be create a gist (gist.github.com), for easier sharing of
code/stacktrace?
Thanks
Vinoth
On Sat, Mar 9, 2019 at 1:33 PM Umesh Kacha <[email protected]> wrote:
> Hi Vinoth thanks I have already did and checked that please see red column
> highlighted below.
>
> root |-- marketplace: string (nullable = true) |-- customer_id: string
> (nullable = true) |-- review_id: string (nullable = true) |-- product_id:
> string (nullable = true) |-- product_parent: string (nullable = true) |--
> product_title: string (nullable = true) |-- product_category: string
> (nullable = true) |-- star_rating: string (nullable = true) |--
> helpful_votes: string (nullable = true) |-- total_votes: string (nullable =
> true) |-- vine: string (nullable = true) |-- verified_purchase: string
> (nullable = true) |-- review_headline: string (nullable = true) |--
> review_body: string (nullable = true) |-- review_date: string (nullable =
> true) |-- year: integer (nullable = true)
>
> On Sun, Mar 10, 2019 at 2:27 AM Vinoth Chandar <[email protected]> wrote:
>
> > Hi,
> >
> > >>review_date(Part
> > -review_date) field not found in record
> >
> > Seems like the precombine field is not in the input DF? Can you try doing
> > df1.printSchema and check that once?
> >
> > On Sat, Mar 9, 2019 at 11:52 AM Umesh Kacha <[email protected]>
> wrote:
> >
> > > Hi I have the following code using which I am trying to bulk insert
> huge
> > > csv file loaded into Spark DataFrame but it fails saying column
> > review_date
> > > not found but that column is definitely there in dataframe. Please
> guide.
> > >
> > > df1.write
> > > .format("com.uber.hoodie")
> > > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
> > > HoodieTableType.COPY_ON_WRITE.name())
> > > .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
> > > DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) // insert
> > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,
> > > "customer_id")
> > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
> "year")
> > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY,
> > > "review_date")
> > > .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test_table")
> > > .mode(SaveMode.Overwrite)
> > > .save("/tmp/hoodie/test_hoodie")
> > >
> > >
> > > Caused by: com.uber.hoodie.exception.HoodieException: review_date(Part
> > > -review_date) field not found in record. Acceptable fields were
> > > :[marketplace, customer_id, review_id, product_id, product_parent,
> > > product_title, product_category, star_rating, helpful_votes,
> total_votes,
> > > vine, verified_purchase, review_headline, review_body, review_date,
> year]
> > > at
> > >
> > >
> >
> com.uber.hoodie.DataSourceUtils.getNestedFieldValAsString(DataSourceUtils.java:79)
> > > at
> > >
> > >
> >
> com.uber.hoodie.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:93)
> > > at
> > >
> > >
> >
> com.uber.hoodie.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:92)
> > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
> > > scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
> > >
> > >
> >
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
> > > at
> > >
> > >
> >
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
> > > at
> > >
> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> > > at
> > >
> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> > > at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at
> > > org.apache.spark.scheduler.Task.run(Task.scala:112) at
> > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
> > > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
> at
> > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > at
> > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > at java.lang.Thread.run(Thread.java:748)
> > > Command took 4.45 seconds -- by [email protected] at 3/10/2019,
> > > 1:17:42
> > > AM on Spark_Hudi
> > > Shift+Enter to run shortcuts
> > >
> >
>