Hi Yuanbin,
Not sure if I completely understood the problem. Are you using
"com.uber.hoodie" format for reading the dataset ? Are you using
hoodie-spark-bundle ?
>From the stack overflow link,
>https://stackoverflow.com/questions/48034825/why-does-streaming-query-fail-with-invalidschemaexception-a-group-type-can-not?noredirect=1&lq=1
> , this could be because of parquet version. Assuming this is the issue, I
>just checked spark-bundle and the parquet class dependencies are all shaded.
>So, the new version of hoodie-spark-bundle should not be a problem as such.
>Please make sure you are only using hoodie-spark-bundle and no other hudi
>packages are in classpath. Also, make sure if spark does not pull in the older
>version of parquet.
Balaji.V
On Wednesday, May 29, 2019, 4:58:37 PM PDT, FIXED-TERM Cheng Yuanbin
(CR/PJ-AI-S1) <[email protected]> wrote:
All,
After we upgrade to the new release 0.4.7. One strange exception occurred when
we read the com.uber.hoodie dataset from parquet.
This exception never occurred in the previous version. I am so appreciate if
anyone can help me locate this exception.
Here I attach part of the exception log.
An exception or error caused a run to abort.
java.lang.ExceptionInInitializerError
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:293)
at
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:285)
at
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:283)
at
org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:303)
at
org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
at
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
................
Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can
not be empty. Parquet does not support empty group without leaves. Empty group:
spark_schema
at org.apache.parquet.schema.GroupType.<init>(GroupType.java:92)
at org.apache.parquet.schema.GroupType.<init>(GroupType.java:48)
at
org.apache.parquet.schema.MessageType.<init>(MessageType.java:50)
at
org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256)
It seems that this exception cause by the schema of the dataframe write to the
Hudi dataset. I careful compared the dataframe in our test case, the only
different is the nullable field.
All test cases in Hudi test schema contains the true nullable field, however,
some of my test cases contain false nullable field.
I tried to convert every nullable to true in our dataset fields, but it still
contain the same exception.
Best regards
Yuanbin Cheng
CR/PJ-AI-S1