[ 
https://issues.apache.org/jira/browse/LIVY-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Bronte updated LIVY-504:
-----------------------------
    Summary: Livy pyspark sqlContext behavior does not match pyspark shell  
(was: Pyspark sqlContext behavior does not match pyspark shell)

> Livy pyspark sqlContext behavior does not match pyspark shell
> -------------------------------------------------------------
>
>                 Key: LIVY-504
>                 URL: https://issues.apache.org/jira/browse/LIVY-504
>             Project: Livy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5.0
>         Environment: AWS EMR 5.16.0
>            Reporter: Adam Bronte
>            Priority: Major
>
> On 0.5.0 I'm seeing inconsistent behavior through Livy regarding the spark 
> context and sqlContext compared to the pyspark shell.
> For example running this through the pyspark shell works:
> {code:java}
> [root@ip-10-0-0-32 ~]# pyspark
> Python 2.7.14 (default, May 2 2018, 18:31:34)
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> 18/08/28 18:50:37 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.3.1
> /_/
> Using Python version 2.7.14 (default, May 2 2018 18:31:34)
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SQLContext
> >>> my_sql_context = SQLContext.getOrCreate(sc)
> >>> df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet')
> >>> print(df.count())
> 67556724
> {code}
> But through Livy, the same code throws an exception
> {code:java}
> from pyspark.sql import SQLContext
> my_sql_context = SQLContext.getOrCreate(sc)
> df = my_sql_context.read.parquet('s3://my-bucket/mydata.parquet')
> 'JavaMember' object has no attribute 'read'
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
> 433, in read
>     return DataFrameReader(self)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 70, in __init__
>     self._jreader = spark._ssql_ctx.read()
> AttributeError: 'JavaMember' object has no attribute 'read'{code}
> Also trying to use the default initialized sqlContext throws the same error
> {code:java}
> df = sqlContext.read.parquet('s3://my-bucket/mydata.parquet')
> 'JavaMember' object has no attribute 'read'
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
> 433, in read
>     return DataFrameReader(self)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 70, in __init__
>     self._jreader = spark._ssql_ctx.read()
> AttributeError: 'JavaMember' object has no attribute 'read'{code}
> In both the spark shell and the livy versions, the objects look the same.
> pyspark shell:
> {code:java}
> >>> print(sc)
> <SparkContext master=yarn appName=PySparkShell>
> >>> print(sqlContext)
> <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450>
> >>> print(my_sql_context)
> <pyspark.sql.context.SQLContext object at 0x7fd15dfc3450>{code}
> livy:
> {code:java}
> print(sc)
> <SparkContext master=yarn appName=livy-session-1>
> print(sqlContext)
> <pyspark.sql.context.SQLContext object at 0x7f478c06b850>
> print(my_sql_context)
> <pyspark.sql.context.SQLContext object at 0x7f478c06b850>{code}
> I'm running this through sparkmagic but also have confirmed this is the same 
> behavior when calling the api directly.
> {code:java}
> curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: 
> application/json" localhost:8998/sessions | python -m json.tool
> {
>     "appId": null,
>     "appInfo": {
>         "driverLogUrl": null,
>         "sparkUiUrl": null
>     },
>     "id": 3,
>     "kind": "pyspark",
>     "log": [
>         "stdout: ",
>         "\nstderr: ",
>         "\nYARN Diagnostics: "
>     ],
>     "owner": null,
>     "proxyUser": null,
>     "state": "starting"
> }
> {code}
> {code:java}
> curl --silent localhost:8998/sessions/3/statements -X POST -H 'Content-Type: 
> application/json' -d '{"code":"df = 
> sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")"}' | python -m 
> json.tool
> {
>     "code": "df = sqlContext.read.parquet(\"s3://my-bucket/mydata.parquet\")",
>     "id": 1,
>     "output": null,
>     "progress": 0.0,
>     "state": "running"
> }
> {code}
> When running on 0.4.0 both pyspark shell and livy versions worked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to