[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...

ueshin Mon, 11 Sep 2017 01:20:07 -0700

Github user ueshin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19147#discussion_r138005735
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/VectorizedPythonRunner.scala
 ---
    @@ -0,0 +1,329 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.python
    +
    +import java.io.{BufferedInputStream, BufferedOutputStream, 
DataInputStream, DataOutputStream}
    +import java.net.Socket
    +import java.nio.charset.StandardCharsets
    +
    +import scala.collection.JavaConverters._
    +
    +import org.apache.arrow.vector.VectorSchemaRoot
    +import org.apache.arrow.vector.stream.{ArrowStreamReader, 
ArrowStreamWriter}
    +
    +import org.apache.spark.{SparkEnv, SparkFiles, TaskContext}
    +import org.apache.spark.api.python.{ChainedPythonFunctions, 
PythonEvalType, PythonException, PythonRDD, SpecialLengths}
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.sql.catalyst.InternalRow
    +import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter}
    +import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, 
ColumnarBatch, ColumnVector}
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Similar to `PythonRunner`, but exchange data with Python worker via 
columnar format.
    + */
    +class VectorizedPythonRunner(
    +    funcs: Seq[ChainedPythonFunctions],
    +    batchSize: Int,
    +    bufferSize: Int,
    +    reuse_worker: Boolean,
    +    argOffsets: Array[Array[Int]]) extends Logging {
    +
    +  require(funcs.length == argOffsets.length, "argOffsets should have the 
same length as funcs")
    +
    +  // All the Python functions should have the same exec, version and 
envvars.
    +  private val envVars = funcs.head.funcs.head.envVars
    +  private val pythonExec = funcs.head.funcs.head.pythonExec
    +  private val pythonVer = funcs.head.funcs.head.pythonVer
    +
    +  // TODO: support accumulator in multiple UDF
    +  private val accumulator = funcs.head.funcs.head.accumulator
    +
    +  // todo: return column batch?
    +  def compute(
    --- End diff --
    
    @icexelloss Ah, I see, thanks! I still agree with refactoring 
`PythonRunner`.
    @BryanCutler As for the error, do you mean the case like 
[test_vectorized_udf_exception](https://github.com/apache/spark/pull/19147/files#diff-7c2fe8530271c0635fb99f7b49e0c4a4R3250)?
 If not, could you please let me know the case and think about it?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...

Reply via email to