Re: PySpark, spill-related (possibly psutil) issue, throwing an exception '_fill_function() takes exactly 4 arguments (5 given)'

Shixiong(Ryan) Zhu Sun, 06 Mar 2016 22:47:03 -0800

Could you rebuild the whole project? I changed the python function
serialization format in https://github.com/apache/spark/pull/11535 to fix a
bug. This exception looks like some place was still using the old codes.


On Sun, Mar 6, 2016 at 6:24 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Just in case, My python version is 2.7.10.
>
> 2016-03-07 11:19 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>:
>
>> Hi all,
>>
>> While I am testing some codes in PySpark, I met a weird issue.
>>
>> This works fine at Spark 1.6.0 but it looks it does not for Spark 2.0.0.
>>
>> When I simply run *logData = sc.textFile(path).coalesce(1) *with some
>> big files in stand-alone local mode without HDFS, this simply throws the
>> exception,
>>
>>
>> *_fill_function() takes exactly 4 arguments (5 given)*
>>
>>
>> I firstly wanted to open a Jira for this but feel like it is a too
>> primitive functionality and I felt like I might be doing this wrong.
>>
>>
>>
>> The full error message is below:
>>
>> 16/03/07 11:12:44 INFO rdd.HadoopRDD: Input split:
>> file:/Users/hyukjinkwon/Desktop/workspace/local/spark-local-ade/spark/data/00_REF/20160119000000-20160215235900-TROI_STAT_ADE_0.DAT:2415919104+33554432
>> *16/03/07 11:12:44 INFO rdd.HadoopRDD: Input split:
>> file:/Users/hyukjinkwon/Desktop/workspace/local/spark-local-ade/spark/data/00_REF/20160119000000-20160215235900-TROI_STAT_ADE_0.DAT:805306368+33554432*
>> *16/03/07 11:12:44 INFO rdd.HadoopRDD: Input split:
>> file:/Users/hyukjinkwon/Desktop/workspace/local/spark-local-ade/spark/data/00_REF/20160119000000-20160215235900-TROI_STAT_ADE_0.DAT:0+33554432*
>> *16/03/07 11:12:44 INFO rdd.HadoopRDD: Input split:
>> file:/Users/hyukjinkwon/Desktop/workspace/local/spark-local-ade/spark/data/00_REF/20160119000000-20160215235900-TROI_STAT_ADE_0.DAT:1610612736+33554432*
>> *16/03/07 11:12:44 ERROR executor.Executor: Exception in task 2.0 in
>> stage 0.0 (TID 2)*
>> *org.apache.spark.api.python.PythonException: Traceback (most recent call
>> last):*
>> *  File "./python/pyspark/worker.py", line 98, in main*
>> *    command = pickleSer._read_with_length(infile)*
>> *  File "./python/pyspark/serializers.py", line 164, in _read_with_length*
>> *    return self.loads(obj)*
>> *  File "./python/pyspark/serializers.py", line 422, in loads*
>> *    return pickle.loads(obj)*
>> *TypeError: ('_fill_function() takes exactly 4 arguments (5 given)',
>> <function _fill_function at 0x101e105f0>, (<function add_shuffle_key at
>> 0x10612c488>, {'defaultdict': <type 'collections.defaultdict'>,
>> 'get_used_memory': <function get_used_memory at 0x1027c8b18>, 'pack_long':
>> <function pack_long at 0x101e1ec08>}, None, {}, 'pyspark.rdd'))*
>>
>> * at
>> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:168)*
>> * at
>> org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:209)*
>> * at
>> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:127)*
>> * at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:62)*
>> * at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)*
>> * at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)*
>> * at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:349)*
>> * at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)*
>> * at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)*
>> * at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)*
>> * at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)*
>> * at org.apache.spark.scheduler.Task.run(Task.scala:82)*
>> * at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)*
>> * at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
>> * at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
>> * at java.lang.Thread.run(Thread.java:745)*
>> *16/03/07 11:12:44 ERROR executor.Executor: Exception in task 3.0 in
>> stage 0.0 (TID 3)*
>> *org.apache.spark.api.python.PythonException: Traceback (most recent call
>> last):*
>> *  File "./python/pyspark/worker.py", line 98, in main*
>> *    command = pickleSer._read_with_length(infile)*
>> *  File "./python/pyspark/serializers.py", line 164, in _read_with_length*
>> *    return self.loads(obj)*
>> *  File "./python/pyspark/serializers.py", line 422, in loads*
>> *    return pickle.loads(obj)*
>> *TypeError: ('_fill_function() takes exactly 4 arguments (5 given)',
>> <function _fill_function at 0x101e105f0>, (<function add_shuffle_key at
>> 0x10612c488>, {'defaultdict': <type 'collections.defaultdict'>,
>> 'get_used_memory': <function get_used_memory at 0x1027c8b18>, 'pack_long':
>> <function pack_long at 0x101e1ec08>}, None, {}, 'pyspark.rdd'))*
>>
>>
>> Thanks!
>>
>
>

Re: PySpark, spill-related (possibly psutil) issue, throwing an exception '_fill_function() takes exactly 4 arguments (5 given)'

Reply via email to