jkdll edited a comment on issue #3113:
URL: https://github.com/apache/hudi/issues/3113#issuecomment-870598881


   After further investiation; it seems like the first error 
(`java.lang.ArrayIndexOutOfBoundsException`) is related to data types being 
changed and new fields being introduced in the middle of the schema. The topic 
I am reading from contains a lot of data whose structure has changed quite 
drastically. It is backward compatible, but fields were introduced in the 
middle of the schema. Thus I have concluded that as per the documentation, such 
schema evolution will not work.
   
   Instead, I have opted for a different setup, but I am still experiencing the 
same error as reported (Unsupported type NULL):
   
   1. Using `org.apache.hudi.utilities.schema.SchemaRegistryProvider`, where I 
define the source and target schemas registry with the following configs:
   ```
   --hoodie-conf "hoodie.deltastreamer.schemaprovider.registry.targetUrl=
   --hoodie-conf "hoodie.deltastreamer.schemaprovider.registry.url=
   ```
   2. The schema specified for `targetURL` is a reduced version of the schema, 
containing just two fields. 
   3. Within the deltastreamer command, I use an SQLQueryBasedTransformer with 
the following properties:
   ```
   --transformer-class 
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer"
   --hoodie-conf "hoodie.deltastreamer.transformer.sql=\
   SELECT CAST(id as STRING) as id,\
   CAST(COALESCE(body.Payload.creation.time,'') as STRING) as 
body_Payload_creation_time\
   FROM <SRC>"
   ```
   4. This directly maps to the target schema registry schema:
   ```
   
"{\"type\":\"record\",\"name\":\"test_table_schema\",\"fields\":[{\"name\":\"id\",\"type\":\"string\"},{\"name\":\"body_Payload_creation_time\",\"type\":\"string\"}]}"
   ```
   5. But when using deltastreamer, I am met with this error (same as above):
   ```
   ERROR Client: Application diagnostics message: User class threw exception: 
org.apache.spark.sql.avro.IncompatibleSchemaException: Unsupported type NULL
           at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:130)
           at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:82)
           at 
org.apache.spark.sql.avro.SchemaConverters$$anonfun$1.apply(SchemaConverters.scala:81)
           at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
           at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
           at scala.collection.Iterator$class.foreach(Iterator.scala:891)
           at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
           at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
           at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
           at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
           at scala.collection.AbstractTraversable.map(Traversable.scala:104)
           at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:81)
           at 
org.apache.spark.sql.avro.SchemaConverters$.toSqlType(SchemaConverters.scala:46)
           at 
org.apache.hudi.AvroConversionUtils$.convertAvroSchemaToStructType(AvroConversionUtils.scala:115)
           at 
org.apache.hudi.AvroConversionUtils.convertAvroSchemaToStructType(AvroConversionUtils.scala)
           at 
org.apache.hudi.utilities.schema.SparkAvroPostProcessor.processSchema(SparkAvroPostProcessor.java:44)
           at 
org.apache.hudi.utilities.schema.SchemaProviderWithPostProcessor.lambda$getSourceSchema$0(SchemaProviderWithPostProcessor.java:41)
           at org.apache.hudi.common.util.Option.map(Option.java:107)
           at 
org.apache.hudi.utilities.schema.SchemaProviderWithPostProcessor.getSourceSchema(SchemaProviderWithPostProcessor.java:41)
           at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.registerAvroSchemas(DeltaSync.java:680)
           at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.<init>(DeltaSync.java:209)
           at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:571)
           at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:138)
           at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:102)
           at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:480)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)
   ```
   
   Given that I am explicitly using a smaller schema (which I do not expect to 
be reading data for NULL values, even though I casted it above), I don't know 
why I am getting this error. My reasoning is that since I am now using the 
transformer, and casting null values in the query, I should not be facing this 
error when writing to Hudi. **Could you please clarify whether this reasoning 
is correct?**
   
   The error above is the same as the original ticket when using the Flatten 
Transformer. I have either misunderstood something or the issue does not lie 
with the target table. Please clarify whether this reasoning is correct.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to