Also curious if this error does not happen with 0.4.6? Can you please confirm that? It would be helpful to narrow it down
On Wed, May 29, 2019 at 6:25 PM [email protected] <[email protected]> wrote: > Hi Yuanbin, > > Not sure if I completely understood the problem. Are you using > "com.uber.hoodie" format for reading the dataset ? Are you using > hoodie-spark-bundle ? > From the stack overflow link, > https://stackoverflow.com/questions/48034825/why-does-streaming-query-fail-with-invalidschemaexception-a-group-type-can-not?noredirect=1&lq=1 > , > this could be because of parquet version. Assuming this is the issue, I > just checked spark-bundle and the parquet class dependencies are all > shaded. So, the new version of hoodie-spark-bundle should not be a problem > as such. Please make sure you are only using hoodie-spark-bundle and no > other hudi packages are in classpath. Also, make sure if spark does not > pull in the older version of parquet. > Balaji.V > > On Wednesday, May 29, 2019, 4:58:37 PM PDT, FIXED-TERM Cheng Yuanbin > (CR/PJ-AI-S1) <[email protected]> wrote: > > All, > > After we upgrade to the new release 0.4.7. One strange exception occurred > when we read the com.uber.hoodie dataset from parquet. > This exception never occurred in the previous version. I am so appreciate > if anyone can help me locate this exception. > Here I attach part of the exception log. > > An exception or error caused a run to abort. > java.lang.ExceptionInInitializerError > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:293) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:285) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:283) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:303) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > ................ > > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type > can not be empty. Parquet does not support empty group without leaves. > Empty group: spark_schema > at > org.apache.parquet.schema.GroupType.<init>(GroupType.java:92) > at > org.apache.parquet.schema.GroupType.<init>(GroupType.java:48) > at > org.apache.parquet.schema.MessageType.<init>(MessageType.java:50) > at > org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256) > > It seems that this exception cause by the schema of the dataframe write to > the Hudi dataset. I careful compared the dataframe in our test case, the > only different is the nullable field. > All test cases in Hudi test schema contains the true nullable field, > however, some of my test cases contain false nullable field. > I tried to convert every nullable to true in our dataset fields, but it > still contain the same exception. > > > Best regards > > Yuanbin Cheng > CR/PJ-AI-S1 > >
