Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/7971#discussion_r39197676
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -794,6 +797,45 @@ class SparkContext(config: SparkConf) extends Logging
with ExecutorAllocationCli
}
/**
+ * Reads in a directory of Avro files from HDFS, a local file system
(available on all nodes), or
+ * any Hadoop-supported file system URI. The records are read in as
Generic Avro records. This
+ * also allows a user to register a schema with Kryo, if they so choose
to.
+ *
+ * You can do the following if you know the schema ahead of time:
+ * {{{
+ * val schema = new Schema.Parser().parse(schemaString)
+ * sc.avroFile("/input-path", schema)
+ * }}}
+ *
+ * or just:
+ * {{{
+ * sc.avroFile("/input-path")
+ * }}}
+ */
+ def avroFile(path: String, schemas: Schema*): RDD[GenericRecord] = {
--- End diff --
Not quite familiar with Avro, but why do we need to pass in more than one
`Schema` instances here? Is this because all nested `Schema` classes must also
be registered? If it is the case, would be nice to document it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]