[GitHub] [spark] uzadude opened a new pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

GitBox Wed, 10 Feb 2021 02:31:26 -0800


uzadude opened a new pull request #31543:
URL: https://github.com/apache/spark/pull/31543

### What changes were proposed in this pull request?

Added option to provide Avro schema by URL.

### Why are the changes needed?
(copied from Jira ticket)

We have a use case in which we read a huge table in Avro format. About 30k
columns.

using the default Hive reader - `AvroGenericRecordReader` it is just hangs
forever. after 4 hours not even one task has finished.

We tried instead to use
`spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:

```

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the
data schema

at
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
at
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 53 elided

```

because files schema contain duplicate column names (when considering
case-insensitive).

So we wanted to provide a user schema with non-duplicated fields, but the
schema is huge. a few MBs. it is not practical to provide it in json format.

So we patched spark-avro to be able to get also `avroSchemaUrl` in addition
to `avroSchema` and it worked perfectly.

### How was this patch tested?
added a unitest to AvroSuite and tested locally with patched version

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] uzadude opened a new pull request #31543: [SPARK-34416] Adding support for user provided schema url in Avro

Reply via email to