Re: Failed to bulk insert

Vinoth Chandar Sun, 10 Mar 2019 20:14:33 -0700

+1 yes. if its actually null.

Good catch, Frank! :)


On Sun, Mar 10, 2019 at 7:23 PM kaka chen <[email protected]> wrote:

> A possible root cause is the filed of record is null.
>
> public static String getNestedFieldValAsString(GenericRecord record,
> String fieldName) {
>   String[] parts = fieldName.split("\\.");
>   GenericRecord valueNode = record;
>   int i = 0;
>   for (;i < parts.length; i++) {
>     String part = parts[i];
>     Object val = valueNode.get(part);
>     if (val == null) {
>       break;
>     }
>
>     // return, if last part of name
>     if (i == parts.length - 1) {
>       return val.toString();
>     } else {
>       // VC: Need a test here
>       if (!(val instanceof GenericRecord)) {
>         throw new HoodieException("Cannot find a record at part value
> :" + part);
>       }
>       valueNode = (GenericRecord) val;
>     }
>   }
>   throw new HoodieException(fieldName + "(Part -" + parts[i] + ")
> field not found in record. "
>       + "Acceptable fields were :" + valueNode.getSchema().getFields()
>       .stream().map(Field::name).collect(Collectors.toList()));
> }
>
>
> Vinoth Chandar <[email protected]> 于2019年3月10日周日 下午2:11写道：
>
> > Hmmm. Thats interesting. I can see that the parsing works, since the
> > exception said "Part - review_date". There are definitely users who have
> > done this before.
> > So not sure what's going on.
> >
> > Can you paste the generated Avro schema? following is the corresponding
> > code line
> > log.info(s"Registered avro schema : ${schema.toString(true)}")
> >
> > May be create a gist (gist.github.com), for easier sharing of
> > code/stacktrace?
> > Thanks
> > Vinoth
> >
> > On Sat, Mar 9, 2019 at 1:33 PM Umesh Kacha <[email protected]>
> wrote:
> >
> > > Hi Vinoth thanks I have already did and checked that please see red
> > column
> > > highlighted below.
> > >
> > > root |-- marketplace: string (nullable = true) |-- customer_id: string
> > > (nullable = true) |-- review_id: string (nullable = true) |--
> product_id:
> > > string (nullable = true) |-- product_parent: string (nullable = true)
> |--
> > > product_title: string (nullable = true) |-- product_category: string
> > > (nullable = true) |-- star_rating: string (nullable = true) |--
> > > helpful_votes: string (nullable = true) |-- total_votes: string
> > (nullable =
> > > true) |-- vine: string (nullable = true) |-- verified_purchase: string
> > > (nullable = true) |-- review_headline: string (nullable = true) |--
> > > review_body: string (nullable = true) |-- review_date: string
> (nullable =
> > > true) |-- year: integer (nullable = true)
> > >
> > > On Sun, Mar 10, 2019 at 2:27 AM Vinoth Chandar <[email protected]>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > >>review_date(Part
> > > > -review_date) field not found in record
> > > >
> > > > Seems like the precombine field is not in the input DF? Can you try
> > doing
> > > > df1.printSchema  and check that once?
> > > >
> > > > On Sat, Mar 9, 2019 at 11:52 AM Umesh Kacha <[email protected]>
> > > wrote:
> > > >
> > > > > Hi I have the following code using which I am trying to bulk insert
> > > huge
> > > > > csv file loaded into Spark DataFrame but it fails saying column
> > > > review_date
> > > > > not found but that column is definitely there in dataframe. Please
> > > guide.
> > > > >
> > > > > df1.write
> > > > >       .format("com.uber.hoodie")
> > > > >       .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
> > > > > HoodieTableType.COPY_ON_WRITE.name())
> > > > >       .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
> > > > > DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL) // insert
> > > > >       .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,
> > > > > "customer_id")
> > > > >       .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
> > > "year")
> > > > >       .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY,
> > > > > "review_date")
> > > > >       .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test_table")
> > > > >       .mode(SaveMode.Overwrite)
> > > > >       .save("/tmp/hoodie/test_hoodie")
> > > > >
> > > > >
> > > > > Caused by: com.uber.hoodie.exception.HoodieException:
> > review_date(Part
> > > > > -review_date) field not found in record. Acceptable fields were
> > > > > :[marketplace, customer_id, review_id, product_id, product_parent,
> > > > > product_title, product_category, star_rating, helpful_votes,
> > > total_votes,
> > > > > vine, verified_purchase, review_headline, review_body, review_date,
> > > year]
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> com.uber.hoodie.DataSourceUtils.getNestedFieldValAsString(DataSourceUtils.java:79)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> com.uber.hoodie.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:93)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> com.uber.hoodie.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:92)
> > > > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
> > > > > scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> > > > > at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at
> > > > > org.apache.spark.scheduler.Task.run(Task.scala:112) at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
> > > > > at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432)
> > at
> > > > >
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
> > > at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > > at java.lang.Thread.run(Thread.java:748)
> > > > > Command took 4.45 seconds -- by [email protected] at
> 3/10/2019,
> > > > > 1:17:42
> > > > > AM on Spark_Hudi
> > > > > Shift+Enter to run    shortcuts
> > > > >
> > > >
> > >
> >
>

Re: Failed to bulk insert

Reply via email to