Dan Fike created LIVY-457:
-----------------------------

             Summary: PySpark `sqlContext.sparkSession` incorrect on Spark 2.x
                 Key: LIVY-457
                 URL: https://issues.apache.org/jira/browse/LIVY-457
             Project: Livy
          Issue Type: Bug
    Affects Versions: 0.6.0
         Environment: RHEL6, Spark 2.1.2.1
            Reporter: Dan Fike


It looks like the {{SQLContext}} we create in {{PySpark}} sessions isn't 
constructed correctly. Compare how the behavior has changed between Livy 0.4.0 
and what is currently on {{master}} (0.6.0).

Livy 0.4.0
{code}
$ curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: 
application/json" localhost:8998/sessions | python -m json.tool

$ curl --silent localhost:8998/sessions/1/statements -X POST -H 'Content-Type: 
application/json' -d '{"code":"sqlContext.sparkSession"}' | python -m json.tool

$ curl --silent localhost:8998/sessions/1/statements/0 | python -m json.tool
{
    "id": 0,
    "state": "available",
    "output": {
        "status": "ok",
        "execution_count": 0,
        "data": {
            "text/plain": "<pyspark.sql.session.SparkSession object at 
0x15a26d0>"
        }
    },
    "progress": 1.0
}
{code}

Livy 0.6.0
{code}
$ curl --silent -X POST --data '{"kind": "pyspark"}' -H "Content-Type: 
application/json" localhost:8998/sessions | python -m json.tool

$ curl --silent localhost:8998/sessions/0/statements -X POST -H 'Content-Type: 
application/json' -d '{"code":"sqlContext.sparkSession"}' | python -m json.tool

$ curl --silent localhost:8998/sessions/0/statements/0 | python -m json.tool
{
    "id": 0,
    "code": "sqlContext.sparkSession",
    "state": "available",
    "output": {
        "status": "ok",
        "execution_count": 0,
        "data": {
            "text/plain": "JavaObject id=o4"
        }
    },
    "progress": 1.0
}

$ curl --silent localhost:8998/sessions/0/statements -X POST -H 'Content-Type: 
application/json' -d '{"code":"sqlContext.sparkSession.toString()"}' | python 
-m json.tool

$ curl --silent localhost:8998/sessions/0/statements/1 | python -m json.tool
{
    "id": 1,
    "code": "sqlContext.sparkSession.toString()",
    "state": "available",
    "output": {
        "status": "ok",
        "execution_count": 1,
        "data": {
            "text/plain": "'org.apache.spark.sql.hive.HiveContext@200334d0'"
        }
    },
    "progress": 1.0
}
{code}

Notice how the value of {{sqlContext.sparkSession}} went from a 
{{pyspark.sql.session.SparkSession}} to a 
{{org.apache.spark.sql.hive.HiveContext}}?

I suspect this is because of the change @ 
https://github.com/apache/incubator-livy/commit/c1aafeb6cb87f2bd7f4cb7cf538822b59fb34a9c#diff-c58e3946d3530f54014129c268988e01R563
 passing {{jsqlc}} in as the second positional parameter to {{SQLContext}}, 
whereas the diff @ 
https://github.com/apache/spark/commit/89addd40abdacd65cc03ac8aa5f9cf3dd4a4c19b#diff-74ba016ef40c1cb268e14aee817d71bdR50
 suggests it should be the _third_ positional parameter.

I'd wager the fix is simply to explicitly pass that parameter as a keyword 
argument instead.
{code}
sqlc = SQLContext(sc, jsqlContext=jsqlc)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to