Github user ueshin commented on a diff in the pull request:
https://github.com/apache/spark/pull/19147#discussion_r138005735
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/VectorizedPythonRunner.scala
---
@@ -0,0 +1,329 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.{BufferedInputStream, BufferedOutputStream,
DataInputStream, DataOutputStream}
+import java.net.Socket
+import java.nio.charset.StandardCharsets
+
+import scala.collection.JavaConverters._
+
+import org.apache.arrow.vector.VectorSchemaRoot
+import org.apache.arrow.vector.stream.{ArrowStreamReader,
ArrowStreamWriter}
+
+import org.apache.spark.{SparkEnv, SparkFiles, TaskContext}
+import org.apache.spark.api.python.{ChainedPythonFunctions,
PythonEvalType, PythonException, PythonRDD, SpecialLengths}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter}
+import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector,
ColumnarBatch, ColumnVector}
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+/**
+ * Similar to `PythonRunner`, but exchange data with Python worker via
columnar format.
+ */
+class VectorizedPythonRunner(
+ funcs: Seq[ChainedPythonFunctions],
+ batchSize: Int,
+ bufferSize: Int,
+ reuse_worker: Boolean,
+ argOffsets: Array[Array[Int]]) extends Logging {
+
+ require(funcs.length == argOffsets.length, "argOffsets should have the
same length as funcs")
+
+ // All the Python functions should have the same exec, version and
envvars.
+ private val envVars = funcs.head.funcs.head.envVars
+ private val pythonExec = funcs.head.funcs.head.pythonExec
+ private val pythonVer = funcs.head.funcs.head.pythonVer
+
+ // TODO: support accumulator in multiple UDF
+ private val accumulator = funcs.head.funcs.head.accumulator
+
+ // todo: return column batch?
+ def compute(
--- End diff --
@icexelloss Ah, I see, thanks! I still agree with refactoring
`PythonRunner`.
@BryanCutler As for the error, do you mean the case like
[test_vectorized_udf_exception](https://github.com/apache/spark/pull/19147/files#diff-7c2fe8530271c0635fb99f7b49e0c4a4R3250)?
If not, could you please let me know the case and think about it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]