[
https://issues.apache.org/jira/browse/LIVY-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyorgy Gal updated LIVY-322:
----------------------------
Fix Version/s: 0.10.0
(was: 0.9.0)
This issue has been moved to the 0.10.0 release as part of a bulk update. If
you feel this is moved out inappropriately, feel free to provide justification
and reset the Fix Version to 0.9.0.
> JsonParseException on failure to parse text output from subprocess call to
> hadoop fs -rm
> ----------------------------------------------------------------------------------------
>
> Key: LIVY-322
> URL: https://issues.apache.org/jira/browse/LIVY-322
> Project: Livy
> Issue Type: Bug
> Components: API, Interpreter
> Affects Versions: 0.3
> Reporter: Rick Bernotas
> Priority: Major
> Fix For: 0.10.0
>
> Attachments: patch_LIVY-322_rickbernotas.patch
>
>
> In a pyspark session, if you run a subprocess.call() to do a "hadoop fs -rm"
> on a Hadoop 2.7 cluster, the response from the "hadoop fs -rm" (a text
> response that it has moved the file to the .Trash folder in HDFS) will cause
> a JsonParseException in Livy, and then all following statement executions in
> the session will fail to work right.
> I suspect there is something in the response from the hadoop fs that is
> tripping up Livy in the conversion to Json, perhaps a reserved or special
> character in the response that Livy is not filtering out, as the response is
> otherwise innocuous.
> Livy needs to correctly parse the response and not throw an exception, and
> also in the case that an exception is thrown, the session should be able to
> recover from the exception to continue running statements correctly.
> Following the Json Exception, even a print(1) statement fails to execute
> properly, necessitating the user get a new session to work with.
> Example follows below.
> {code:java}
> ### CREATE A NEW PYSPARK SESSION
> -bash-4.1$ curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type:
> application/json" localhost:8998/sessions
> {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
> ### CHECK THE STATE OF SESSION 2 UNTIL IT GOES FROM "STARTING" STATE TO
> "IDLE" STATE
> -bash-4.1$ curl localhost:8998/sessions/2
> {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"starting","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
> -bash-4.1$ curl localhost:8998/sessions/2
> {"id":2,"appId":null,"owner":null,"proxyUser":null,"state":"idle","kind":"pyspark","appInfo":{"driverLogUrl":null,"sparkUiUrl":null},"log":[]}
> ### RUN THE PYSPARK CODE IN SESSION 2, "import subprocess"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H
> 'Content-Type: application/json' -d '{"code":"import subprocess"}'
> {"id":0,"state":-X POST --data '{"kind": "pyspark"}' -H "Content-Type:
> application/json" localhost:8998/sessions
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/0
> {"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":""}}}
> ### THE OUTPUT IS {"text/plain":""} WHICH IS EXPECTED AND CORRECT
> ### RUN THE PYSPARK CODE IN SESSION 2, "subprocess.call(["hadoop", "fs",
> "-touchz", "foo.tmp"])"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H
> 'Content-Type: application/json' -d '{"code":"subprocess.call([\"hadoop\",
> \"fs\", \"-touchz\", \"foo.tmp\"])"}'
> {"id":1,"state":"running","output":null}
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/1
> {"id":1,"state":"available","output":{"status":"ok","execution_count":1,"data":{"text/plain":"0"}}}
> ### THE OUTPUT IS {"text/plain":"0"} WHICH IS EXPECTED OUTPUT THAT THE TOUCHZ
> COMPLETED WITH RETURN CODE 0.
> ### RUN THE PYSPARK CODE IN SESSION 2,
> "print(subprocess.check_output(["hadoop", "fs", "-ls", "foo.tmp"]))"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H
> 'Content-Type: application/json' -d
> '{"code":"print(subprocess.check_output([\"hadoop\", \"fs\", \"-ls\",
> \"foo.tmp\"]))"}'
> {"id":2,"state":"waiting","output":null}
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/2
> {"id":2,"state":"available","output":{"status":"ok","execution_count":2,"data":{"text/plain":"-rw-------
> 3 username group 0 2017-02-23 19:26 foo.tmp"}}}
> ### THE OUTPUT IS {"text/plain":"-rw------- 3 username group 0
> 2017-02-23 19:26 foo.tmp"} WHICH IS EXPECTED OUTPUT OF DIRECTORY LISTING
> ### RUN THE PYSPARK CODE IN SESSION 2, "subprocess.call(["hadoop", "fs",
> "-rm", "foo.tmp"])"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H
> 'Content-Type: application/json' -d '{"code":"subprocess.call([\"hadoop\",
> \"fs\", \"-rm\", \"foo.tmp\"])"}'
> {"id":3,"state":"waiting","output":null}
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/3
> {"id":3,"state":"available","output":{"status":"error","execution_count":3,"ename":"com.fasterxml.jackson.core.JsonParseException","evalue":"Unrecognized
> token 'Moved': was expecting ('true', 'false' or 'null')\n at [Source:
> Moved: 'foo.tmp' to trash at: .Trash/Current; line: 1, column:
> 6]","traceback":[]}}
> ### JSON EXCEPTION APPEARS HERE WHICH IS INCORRECT PARSING OF THE OUTPUT
> ### RUN THE PYSPARK CODE IN SESSION 2, "print(1)"
> -bash-4.1$ curl localhost:8998/sessions/2/statements -X POST -H
> 'Content-Type: application/json' -d '{"code":"print(1)"}'
> {"id":4,"state":"available","output":null}
> ### GET THE OUTPUT OF THE CODE JUST RUN IN SESSION 2
> -bash-4.1$ curl localhost:8998/sessions/2/statements/4
> {"id":4,"state":"available","output":{"status":"ok","execution_count":4,"data":{"text/plain":""}}}
> ### THE OUTPUT IS {"text/plain":""} WHICH IS EMPTY STRING, INDICATING
> OPERATION COMPLETED WITH NO OUTPUT, WHICH IS INCORRECT, IT SHOULD RETURN 1
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)