Also curious if this error does not happen with 0.4.6? Can you please
confirm that? It would be helpful to narrow it down

On Wed, May 29, 2019 at 6:25 PM [email protected] <[email protected]>
wrote:

>  Hi Yuanbin,
>
> Not sure if I completely understood the problem. Are you using
> "com.uber.hoodie" format for reading the dataset ? Are you using
> hoodie-spark-bundle ?
> From the stack overflow link,
> https://stackoverflow.com/questions/48034825/why-does-streaming-query-fail-with-invalidschemaexception-a-group-type-can-not?noredirect=1&lq=1
>  ,
> this could be because of parquet version. Assuming this is the issue, I
> just checked spark-bundle and the parquet class dependencies are all
> shaded.  So, the new version of hoodie-spark-bundle should not be a problem
> as such.  Please make sure you are only using hoodie-spark-bundle and no
> other hudi packages are in classpath. Also, make sure if spark does not
> pull in the older version of parquet.
> Balaji.V
>
>     On Wednesday, May 29, 2019, 4:58:37 PM PDT, FIXED-TERM Cheng Yuanbin
> (CR/PJ-AI-S1) <[email protected]> wrote:
>
>  All,
>
> After we upgrade to the new release 0.4.7. One strange exception occurred
> when we read the com.uber.hoodie dataset from parquet.
> This exception never occurred in the previous version. I am so appreciate
> if anyone can help me locate this exception.
> Here I attach part of the exception log.
>
> An exception or error caused a run to abort.
> java.lang.ExceptionInInitializerError
>               at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:293)
>               at
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:285)
>               at
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:283)
>               at
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:303)
>               at
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
>               at
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
>               at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
> ................
>
> Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type
> can not be empty. Parquet does not support empty group without leaves.
> Empty group: spark_schema
>               at
> org.apache.parquet.schema.GroupType.<init>(GroupType.java:92)
>               at
> org.apache.parquet.schema.GroupType.<init>(GroupType.java:48)
>               at
> org.apache.parquet.schema.MessageType.<init>(MessageType.java:50)
>               at
> org.apache.parquet.schema.Types$MessageTypeBuilder.named(Types.java:1256)
>
> It seems that this exception cause by the schema of the dataframe write to
> the Hudi dataset. I careful compared the dataframe in our test case, the
> only different is the nullable field.
> All test cases in Hudi test schema contains the true nullable field,
> however, some of my test cases contain false nullable field.
> I tried to convert every nullable to true in our dataset fields, but it
> still contain the same exception.
>
>
> Best regards
>
> Yuanbin Cheng
> CR/PJ-AI-S1
>
>

Reply via email to