uzadude opened a new pull request #31543:
URL: https://github.com/apache/spark/pull/31543


   ### What changes were proposed in this pull request?
   
   Added option to provide Avro schema by URL.
   
   ### Why are the changes needed?
   (copied from Jira ticket)
   
   We have a use case in which we read a huge table in Avro format. About 30k 
columns.
   
   using the default Hive reader - `AvroGenericRecordReader` it is just hangs 
forever. after 4 hours not even one task has finished.
   
   We tried instead to use 
`spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
   
   ```
   
   org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the 
data schema
   
   ..
   
   at 
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
   at 
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
   at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
   at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
   ... 53 elided
   
   ```
   
    
   
   because files schema contain duplicate column names (when considering 
case-insensitive).
   
   So we wanted to provide a user schema with non-duplicated fields, but the 
schema is huge. a few MBs. it is not practical to provide it in json format.
   
    
   
   So we patched spark-avro to be able to get also `avroSchemaUrl` in addition 
to `avroSchema` and it worked perfectly.
   
   
   ### How was this patch tested?
   added a unitest to AvroSuite and tested locally with patched version
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to