[ https://issues.apache.org/jira/browse/SPARK-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847651#comment-16847651 ]
Kyle Brooks edited comment on SPARK-7898 at 5/24/19 3:44 PM: ------------------------------------------------------------- I have a use case for printing an aggregation in spark to stdout for use in a bash script: {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.range(10) count = df.count() import sys sys.stderr.write('Why does this get redirected to stdout?\n') # This is the only thing I want in stdout: print(count) {code} In client mode on a yarn cluster: {code:java} spark2-submit --master yarn --deploy-mode client test_print.py 2>stderr.log 1>stdout.log {code} When I look in stdout.log: {code:java} Why does this get redirected to stdout? 10 {code} The printing to stderr is done by a third party library I don't have control over. Is this behavior by design? I understand that in some modes, the driver will not run on the machine the job is submitted but for the ones it does, this seems broken. was (Author: brookskd): I have a use case for printing an aggregation does in spark to stdout for use in a bash script: {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.range(10) count = df.count() import sys sys.stderr.write('Why does this get redirected to stdout?\n') # This is the only thing I want in stdout: print(count) {code} In client mode on a yarn cluster: {code:java} spark2-submit --master yarn --deploy-mode client test_print.py 2>stderr.log 1>stdout.log {code} When I look in stdout.log: {code:java} Why does this get redirected to stdout? 10 {code} The printing to stderr is done by a third party library I don't have control over. Is this behavior by design? I understand that in some modes, the driver will not run on the machine the job is submitted but for the ones it does, this seems broken. > pyspark merges stderr into stdout > --------------------------------- > > Key: SPARK-7898 > URL: https://issues.apache.org/jira/browse/SPARK-7898 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.3.0 > Reporter: Sam Steingold > Priority: Major > > When I type > {code} > hadoop fs -text /foo/bar/baz.bz2 2>err 1>out > {code} > I get two non-empty files: {{err}} with > {code} > 2015-05-26 15:33:49,786 INFO [main] bzip2.Bzip2Factory > (Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded & > initialized native-bzip2 library system-native > 2015-05-26 15:33:49,789 INFO [main] compress.CodecPool > (CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2] > {code} > and {{out}} with the content of the file (as expected). > When I call the same command from Python (2.6): > {code} > from subprocess import Popen > with open("out","w") as out: > with open("err","w") as err: > p = Popen(['hadoop','fs','-text',"/foo/bar/baz.bz2"], > stdin=None,stdout=out,stderr=err) > print p.wait() > {code} > I get the exact same (correct) behavior. > *However*, when I run the same code under *PySpark* (or using > {{spark-submit}}), I get an *empty* {{err}} file and the {{out}} file starts > with the log messages above (and then it contains the actual data). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org