xkrogen commented on a change in pull request #31133:
URL: https://github.com/apache/spark/pull/31133#discussion_r570355910
##########
File path: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala
##########
@@ -388,6 +394,9 @@ private[hive] object HiveTableUtil {
private[hive] object DeserializerLock
private[hive] object HadoopTableReader extends HiveInspectors with Logging {
+
+ val avroTableProperties =
AvroTableProperties.values().map(_.getPropName()).toSet
Review comment:
It seems that the examples of `avro.serde.type` and
`avro.serde.skip.bytes` provide good reason why we shouldn't pick up
`AvroTableProperties.*` or `startsWith("avro.")`. These properties indicate
something about the specific files, e.g. it's possible you could have some
partitions serialized with the Confluence Avro serializer (thus requiring the
`avro.serde` properties) and some partitions serialized with normal Avro
behavior. Schemas seem to be a special case.
Looking at the full list of properties in Hive 2.3:
```
SCHEMA_LITERAL("avro.schema.literal"),
SCHEMA_URL("avro.schema.url"),
SCHEMA_NAMESPACE("avro.schema.namespace"),
SCHEMA_NAME("avro.schema.name"),
SCHEMA_DOC("avro.schema.doc"),
AVRO_SERDE_SCHEMA("avro.serde.schema"),
SCHEMA_RETRIEVER("avro.schema.retriever");
```
It seems that the first 5 we would definitely want at the table level, and
probably the last one (?). I don't understand `avro.serde.schema` and how it
relates to the others. But it might be reasonable to say that we should take
all `startsWith("avro.schema.")`
Separately from the question of _which_ properties to pick up, there is the
question of whether to hard-code the strings or use the enum values.
@dongjoon-hyun can you explain more fully why you don't want to depend on
`AvroTableProperties`? We already have another reference to it in the Spark
codebase [within
HiveShim](https://github.com/apache/spark/blob/6a194f197819e2970f619ed92321e0a0b716bf37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala#L92).
I don't understand the argument that some older branches may not have this
class available -- if an older branch has a Hive dependency that is too old and
we need to cherry pick this change there, then we can hard-code it in the
cherry pick, but I don't see a reason why we can't use the enum in newer
versions.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]