Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/21546#discussion_r199612847
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala ---
@@ -34,17 +33,19 @@ private[sql] object PythonSQLUtils {
}
/**
- * Python Callable function to convert ArrowPayloads into a
[[DataFrame]].
+ * Python callable function to read a file in Arrow stream format and
create a [[DataFrame]]
+ * using each serialized ArrowRecordBatch as a partition.
*
- * @param payloadRDD A JavaRDD of ArrowPayloads.
- * @param schemaString JSON Formatted Schema for ArrowPayloads.
* @param sqlContext The active [[SQLContext]].
- * @return The converted [[DataFrame]].
+ * @param filename File to read the Arrow stream from.
+ * @param schemaString JSON Formatted Spark schema for Arrow batches.
+ * @return A new [[DataFrame]].
*/
- def arrowPayloadToDataFrame(
- payloadRDD: JavaRDD[Array[Byte]],
- schemaString: String,
- sqlContext: SQLContext): DataFrame = {
- ArrowConverters.toDataFrame(payloadRDD, schemaString, sqlContext)
+ def arrowReadStreamFromFile(
--- End diff --
`arrowStreamFromFile` is important to get in the name since it is a stream
format being read from a file, but how about `arrowStreamFromFileToDataFrame`?
Its a bit long but it would be good to indicate that it produces a `DataFrame`
for the call from Python.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]