Yi Zhou created SPARK-10310:
-------------------------------
Summary: [Spark SQL] All result records will be popluated in ONE
line due to missing the correct line/filed separator
Key: SPARK-10310
URL: https://issues.apache.org/jira/browse/SPARK-10310
Project: Spark
Issue Type: Bug
Components: SQL
Reporter: Yi Zhou
Priority: Blocker
There is real case using python stream script in Spark SQL query. We found that
all result records from "select" write in ONE line as input for python script
and so it cause script will not identify each record.Other, filed separator in
spark sql will be '^A' or '\001' which is inconsistent the '\t' in Hive
implementation.
#################Key Query####################:
CREATE VIEW temp1 AS
SELECT *
FROM
(
FROM
(
SELECT
c.wcs_user_sk,
w.wp_type,
(wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
FROM web_clickstreams c, web_page w
WHERE c.wcs_web_page_sk = w.wp_web_page_sk
AND c.wcs_web_page_sk IS NOT NULL
AND c.wcs_user_sk IS NOT NULL
AND c.wcs_sales_sk IS NULL --abandoned implies: no sale
DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
) clicksAnWebPageType
REDUCE
wcs_user_sk,
tstamp_inSec,
wp_type
USING 'python sessionize.py 3600'
AS (
wp_type STRING,
tstamp BIGINT,
sessionid STRING)
) sessionized
#############Key Python Script#################
for line in sys.stdin:
user_sk, tstamp_str, value = line.strip().split("\t")
############Result Records example from 'select' ##################
^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
############Result Records example in format######################
31 3237764860 feedback
31 3237769106 dynamic
31 3237779027 review
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]