[ 
https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15374241#comment-15374241
 ] 

Adrian Wang commented on SPARK-16515:
-------------------------------------

The problem is spark did not find the right record writer from its conf when it 
has to write records to standard output. So when python read data from standard 
input, it crashes.

> [SPARK][SQL] transformation script got failure for python script
> ----------------------------------------------------------------
>
>                 Key: SPARK-16515
>                 URL: https://issues.apache.org/jira/browse/SPARK-16515
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Yi Zhou
>            Priority: Critical
>
> Run below SQL and get transformation script error for python script like 
> below error message.
> Query SQL:
> {code}
> CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
> SELECT DISTINCT
>   sessionid,
>   wcs_item_sk
> FROM
> (
>   FROM
>   (
>     SELECT
>       wcs_user_sk,
>       wcs_item_sk,
>       (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
>     FROM web_clickstreams
>     WHERE wcs_item_sk IS NOT NULL
>     AND   wcs_user_sk IS NOT NULL
>     DISTRIBUTE BY wcs_user_sk
>     SORT BY
>       wcs_user_sk,
>       tstamp_inSec -- "sessionize" reducer script requires the cluster by uid 
> and sort by tstamp
>   ) clicksAnWebPageType
>   REDUCE
>     wcs_user_sk,
>     tstamp_inSec,
>     wcs_item_sk
>   USING 'python q2-sessionize.py 3600'
>   AS (
>     wcs_item_sk BIGINT,
>     sessionid STRING)
> ) q02_tmp_sessionize
> CLUSTER BY sessionid
> {code}
> Error Message:
> {code}
> 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 
> (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with 
> status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in <module>
>     user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>       at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>       at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>       at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>       at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>       at org.apache.spark.scheduler.Task.run(Task.scala:85)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. 
> Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in <module>
>     user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
>       at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
>       at 
> org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
>       ... 14 more
> 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 
> (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess 
> exited with status 1. Error: Traceback (most recent call last):
>   File "q2-sessionize.py", line 49, in <module>
>     user_sk, tstamp_str, item_sk  = line.strip().split("\t")
> ValueError: too many values to unpack
> ) [duplicate 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to