Hi, This does sound like a jar mismatch issue from spark version. I have seen similar ticket associated with spark 2.1.x IIRC. If you are building your own uber/fat jar then probably better to depend on hoodie-spark module than the hoodie-spark-bundle which is a uberjar itself.
What version of spark are you using? Thanks Vinoth On Thu, May 30, 2019 at 11:24 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) < [email protected]> wrote: > Hi, > > The test case is really very simple, just like the hoodie test case. > I have two dataframe, using the CopyOnWrite, first write the first one > with Overwrite, and then write the second one Append, both operation use > the format "com.uber.hoodie". > However, the exception occur when I read the dataset after this two write > operation. > I used the Maven to manage the dependencies, here is the part of my maven > dependencies: > > <dependency> > <groupId>com.uber.hoodie</groupId> > <artifactId>hoodie-spark-bundle</artifactId> > <version>0.4.7</version> > </dependency> > > This exception only happen in 0.4.7, if I change it to 0.4.6, it works > very well. > I have ran the same test in > 1. GitHub repository compiled on my laptop > 2. Source Code of the 0.4.7 compiled on my laptop > All worked very well. > > Maybe, it because of the Maven release. > > Best regards > > Yuanbin Cheng > CR/PJ-AI-S1 > > > > -----Original Message----- > From: Vinoth Chandar <[email protected]> > Sent: Wednesday, May 29, 2019 8:00 PM > To: [email protected] > Subject: Re: Strange exception after upgrade to 0.4.7 > > Also curious if this error does not happen with 0.4.6? Can you please > confirm that? It would be helpful to narrow it down > > On Wed, May 29, 2019 at 6:25 PM [email protected] <[email protected]> > wrote: > > > Hi Yuanbin, > > > > Not sure if I completely understood the problem. Are you using > > "com.uber.hoodie" format for reading the dataset ? Are you using > > hoodie-spark-bundle ? > > From the stack overflow link, > > https://stackoverflow.com/questions/48034825/why-does-streaming-query- > > fail-with-invalidschemaexception-a-group-type-can-not?noredirect=1&lq= > > 1 , this could be because of parquet version. Assuming this is the > > issue, I just checked spark-bundle and the parquet class dependencies > > are all shaded. So, the new version of hoodie-spark-bundle should not > > be a problem as such. Please make sure you are only using > > hoodie-spark-bundle and no other hudi packages are in classpath. Also, > make sure if spark does not pull in the older version of parquet. > > Balaji.V > > > > On Wednesday, May 29, 2019, 4:58:37 PM PDT, FIXED-TERM Cheng > > Yuanbin > > (CR/PJ-AI-S1) <[email protected]> wrote: > > > > All, > > > > After we upgrade to the new release 0.4.7. One strange exception > > occurred when we read the com.uber.hoodie dataset from parquet. > > This exception never occurred in the previous version. I am so > > appreciate if anyone can help me locate this exception. > > Here I attach part of the exception log. > > > > An exception or error caused a run to abort. > > java.lang.ExceptionInInitializerError > > at > > > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:293) > > at > > > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:285) > > at > > > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:283) > > at > > > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:303) > > at > > > org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) > > at > > > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386) > > at > > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(Spar > > kPlan.scala:117) > > ................ > > > > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group > > type can not be empty. Parquet does not support empty group without > leaves. > > Empty group: spark_schema > > at > > org.apache.parquet.schema.GroupType.<init>(GroupType.java:92) > > at > > org.apache.parquet.schema.GroupType.<init>(GroupType.java:48) > > at > > org.apache.parquet.schema.MessageType.<init>(MessageType.java:50) > > at > > org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:12 > > 56) > > > > It seems that this exception cause by the schema of the dataframe > > write to the Hudi dataset. I careful compared the dataframe in our > > test case, the only different is the nullable field. > > All test cases in Hudi test schema contains the true nullable field, > > however, some of my test cases contain false nullable field. > > I tried to convert every nullable to true in our dataset fields, but > > it still contain the same exception. > > > > > > Best regards > > > > Yuanbin Cheng > > CR/PJ-AI-S1 > > > > >
