[jira] [Comment Edited] (SPARK-7898) pyspark merges stderr into stdout

Kyle Brooks (JIRA) Fri, 24 May 2019 08:45:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-7898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847651#comment-16847651
 ]


Kyle Brooks edited comment on SPARK-7898 at 5/24/19 3:44 PM:
-------------------------------------------------------------

I have a use case for printing an aggregation in spark to stdout for use in a 
bash script:
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.range(10)

count = df.count()

import sys

sys.stderr.write('Why does this get redirected to stdout?\n')

# This is the only thing I want in stdout:
print(count)
{code}
In client mode on a yarn cluster:
{code:java}
spark2-submit --master yarn --deploy-mode client test_print.py 2>stderr.log 
1>stdout.log
{code}
When I look in stdout.log:
{code:java}
Why does this get redirected to stdout?
10
{code}
The printing to stderr is done by a third party library I don't have control 
over.

Is this behavior by design?  I understand that in some modes, the driver will 
not run on the machine the job is submitted but for the ones it does, this 
seems broken.


was (Author: brookskd):
I have a use case for printing an aggregation does in spark to stdout for use 
in a bash script:
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.range(10)

count = df.count()

import sys

sys.stderr.write('Why does this get redirected to stdout?\n')

# This is the only thing I want in stdout:
print(count)
{code}
In client mode on a yarn cluster:
{code:java}
spark2-submit --master yarn --deploy-mode client test_print.py 2>stderr.log 
1>stdout.log
{code}
When I look in stdout.log:
{code:java}
Why does this get redirected to stdout?
10
{code}
The printing to stderr is done by a third party library I don't have control 
over.

Is this behavior by design?  I understand that in some modes, the driver will 
not run on the machine the job is submitted but for the ones it does, this 
seems broken.

> pyspark merges stderr into stdout
> ---------------------------------
>
>                 Key: SPARK-7898
>                 URL: https://issues.apache.org/jira/browse/SPARK-7898
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.3.0
>            Reporter: Sam Steingold
>            Priority: Major
>
> When I type 
> {code}
> hadoop fs -text /foo/bar/baz.bz2 2>err 1>out
> {code}
> I get two non-empty files: {{err}} with 
> {code}
> 2015-05-26 15:33:49,786 INFO  [main] bzip2.Bzip2Factory 
> (Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded & 
> initialized native-bzip2 library system-native
> 2015-05-26 15:33:49,789 INFO  [main] compress.CodecPool 
> (CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2]
> {code}
> and {{out}} with the content of the file (as expected).
> When I call the same command from Python (2.6):
> {code}
> from subprocess import Popen
> with open("out","w") as out:
>     with open("err","w") as err:
>         p = Popen(['hadoop','fs','-text',"/foo/bar/baz.bz2"],
>                   stdin=None,stdout=out,stderr=err)
> print p.wait()
> {code}
> I get the exact same (correct) behavior.
> *However*, when I run the same code under *PySpark* (or using 
> {{spark-submit}}), I get an *empty* {{err}} file and the {{out}} file starts 
> with the log messages above (and then it contains the actual data).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7898) pyspark merges stderr into stdout

Reply via email to