Hi,

This does sound like a jar mismatch issue from spark version. I have seen
similar ticket associated with spark 2.1.x IIRC. If you are building your
own uber/fat jar then probably better to depend on hoodie-spark module than
the hoodie-spark-bundle which is a uberjar itself.

What version of spark are you using?

Thanks
Vinoth

On Thu, May 30, 2019 at 11:24 AM FIXED-TERM Cheng Yuanbin (CR/PJ-AI-S1) <
[email protected]> wrote:

> Hi,
>
> The test case is really very simple, just like the hoodie test case.
> I have two dataframe, using the CopyOnWrite, first write the first one
> with Overwrite, and then write the second one Append, both operation use
> the format "com.uber.hoodie".
> However, the exception occur when I read the dataset after this two write
> operation.
> I used the Maven to manage the dependencies, here is the part of my maven
> dependencies:
>
>         <dependency>
>             <groupId>com.uber.hoodie</groupId>
>             <artifactId>hoodie-spark-bundle</artifactId>
>             <version>0.4.7</version>
>         </dependency>
>
> This exception only happen in 0.4.7, if I change it to 0.4.6, it works
> very well.
> I have ran the same test in
> 1. GitHub repository compiled on my laptop
> 2. Source Code of the 0.4.7 compiled on my laptop
> All worked very well.
>
> Maybe, it because of the Maven release.
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>
>
> -----Original Message-----
> From: Vinoth Chandar <[email protected]>
> Sent: Wednesday, May 29, 2019 8:00 PM
> To: [email protected]
> Subject: Re: Strange exception after upgrade to 0.4.7
>
> Also curious if this error does not happen with 0.4.6? Can you please
> confirm that? It would be helpful to narrow it down
>
> On Wed, May 29, 2019 at 6:25 PM [email protected] <[email protected]>
> wrote:
>
> >  Hi Yuanbin,
> >
> > Not sure if I completely understood the problem. Are you using
> > "com.uber.hoodie" format for reading the dataset ? Are you using
> > hoodie-spark-bundle ?
> > From the stack overflow link,
> > https://stackoverflow.com/questions/48034825/why-does-streaming-query-
> > fail-with-invalidschemaexception-a-group-type-can-not?noredirect=1&lq=
> > 1 , this could be because of parquet version. Assuming this is the
> > issue, I just checked spark-bundle and the parquet class dependencies
> > are all shaded.  So, the new version of hoodie-spark-bundle should not
> > be a problem as such.  Please make sure you are only using
> > hoodie-spark-bundle and no other hudi packages are in classpath. Also,
> make sure if spark does not pull in the older version of parquet.
> > Balaji.V
> >
> >     On Wednesday, May 29, 2019, 4:58:37 PM PDT, FIXED-TERM Cheng
> > Yuanbin
> > (CR/PJ-AI-S1) <[email protected]> wrote:
> >
> >  All,
> >
> > After we upgrade to the new release 0.4.7. One strange exception
> > occurred when we read the com.uber.hoodie dataset from parquet.
> > This exception never occurred in the previous version. I am so
> > appreciate if anyone can help me locate this exception.
> > Here I attach part of the exception log.
> >
> > An exception or error caused a run to abort.
> > java.lang.ExceptionInInitializerError
> >               at
> >
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:293)
> >               at
> >
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:285)
> >               at
> >
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:283)
> >               at
> >
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:303)
> >               at
> >
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
> >               at
> >
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
> >               at
> > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(Spar
> > kPlan.scala:117)
> > ................
> >
> > Caused by: org.apache.parquet.schema.InvalidSchemaException: A group
> > type can not be empty. Parquet does not support empty group without
> leaves.
> > Empty group: spark_schema
> >               at
> > org.apache.parquet.schema.GroupType.<init>(GroupType.java:92)
> >               at
> > org.apache.parquet.schema.GroupType.<init>(GroupType.java:48)
> >               at
> > org.apache.parquet.schema.MessageType.<init>(MessageType.java:50)
> >               at
> > org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:12
> > 56)
> >
> > It seems that this exception cause by the schema of the dataframe
> > write to the Hudi dataset. I careful compared the dataframe in our
> > test case, the only different is the nullable field.
> > All test cases in Hudi test schema contains the true nullable field,
> > however, some of my test cases contain false nullable field.
> > I tried to convert every nullable to true in our dataset fields, but
> > it still contain the same exception.
> >
> >
> > Best regards
> >
> > Yuanbin Cheng
> > CR/PJ-AI-S1
> >
> >
>

Reply via email to