Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1551#discussion_r15568273
  
    --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
    @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
                   throw new SparkException("Unexpected Tuple2 element type " + 
pair._1.getClass)
               }
             case other =>
    -          throw new SparkException("Unexpected element type " + 
first.getClass)
    +          if (other == null) {
    +            dataOut.writeInt(SpecialLengths.NULL)
    +            writeIteratorToStream(iter, dataOut)
    --- End diff --
    
    If users want to call UDF in Java/Scala from PySpark, they have to use this 
private API to do it, so it's possible to have null in RDD[string] or 
RDD[Array[Byte]].
    
    BTW, it will be helpful if we can skip some BAD rows during map/reduce, 
which was mentioned in MapReduce paper. This is not MUST have feature, but it 
really improve the robustness of whole framework, very useful for large scale 
jobs.
    
    This PR try to improve the stability of PySpark, let users feel safer and 
happier to hack in PySpark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to