Re: [build system] bumped pull request builder job timeout to 400mins

2018-08-07 Thread Hyukjin Kwon
Thanks, Shane.

2018년 8월 8일 (수) 오전 1:05, shane knapp 님이 작성:

> i hate doing this, because our tests and builds take WY too long,
> but this should help get PRs through before the code freeze.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
Hi Sean,

On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen  wrote:
> Ah, python.  How about SparkContext._active_spark_context then?

Ah yes, that looks like the right member, but I'm a bit wary about
depending on functionality of objects with leading underscores. I
assumed that was "private" and subject to change. Is that something I
should be unconcerned about.

The other thought is that the accesses with SparkContext are protected
by "SparkContext._lock" -- should I also use that lock?

Thanks for your help!
Andrew

>
> On Tue, Aug 7, 2018 at 5:34 PM Andrew Melo  wrote:
>>
>> Hi Sean,
>>
>> On Tue, Aug 7, 2018 at 5:16 PM, Sean Owen  wrote:
>> > Is SparkSession.getActiveSession what you're looking for?
>>
>> Perhaps -- though there's not a corresponding python function, and I'm
>> not exactly sure how to call the scala getActiveSession without first
>> instantiating the python version and causing a JVM to start.
>>
>> Is there an easy way to call getActiveSession that doesn't start a JVM?
>>
>> Cheers
>> Andrew
>>
>> >
>> > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo 
>> > wrote:
>> >>
>> >> Hello,
>> >>
>> >> One pain point with various Jupyter extensions [1][2] that provide
>> >> visual feedback about running spark processes is the lack of a public
>> >> API to introspect the web URL. The notebook server needs to know the
>> >> URL to find information about the current SparkContext.
>> >>
>> >> Simply looking for "localhost:4040" works most of the time, but fails
>> >> if multiple spark notebooks are being run on the same host -- spark
>> >> increments the port for each new context, leading to confusion when
>> >> the notebooks are trying to probe the web interface for information.
>> >>
>> >> I'd like to implement an analog to SparkContext.getOrCreate(), perhaps
>> >> called "getIfExists()" that returns the current singleton if it
>> >> exists, or None otherwise. The Jupyter code would then be able to use
>> >> this entrypoint to query Spark for an active Spark context, which it
>> >> could use to probe the web URL.
>> >>
>> >> It's a minor change, but this would be my first contribution to Spark,
>> >> and I want to make sure my plan was kosher before I implemented it.
>> >>
>> >> Thanks!
>> >> Andrew
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> [1] https://krishnan-r.github.io/sparkmonitor/
>> >>
>> >> [2] https://github.com/mozilla/jupyter-spark
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SparkContext singleton get w/o create?

2018-08-07 Thread Sean Owen
Ah, python.  How about SparkContext._active_spark_context then?

On Tue, Aug 7, 2018 at 5:34 PM Andrew Melo  wrote:

> Hi Sean,
>
> On Tue, Aug 7, 2018 at 5:16 PM, Sean Owen  wrote:
> > Is SparkSession.getActiveSession what you're looking for?
>
> Perhaps -- though there's not a corresponding python function, and I'm
> not exactly sure how to call the scala getActiveSession without first
> instantiating the python version and causing a JVM to start.
>
> Is there an easy way to call getActiveSession that doesn't start a JVM?
>
> Cheers
> Andrew
>
> >
> > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo 
> wrote:
> >>
> >> Hello,
> >>
> >> One pain point with various Jupyter extensions [1][2] that provide
> >> visual feedback about running spark processes is the lack of a public
> >> API to introspect the web URL. The notebook server needs to know the
> >> URL to find information about the current SparkContext.
> >>
> >> Simply looking for "localhost:4040" works most of the time, but fails
> >> if multiple spark notebooks are being run on the same host -- spark
> >> increments the port for each new context, leading to confusion when
> >> the notebooks are trying to probe the web interface for information.
> >>
> >> I'd like to implement an analog to SparkContext.getOrCreate(), perhaps
> >> called "getIfExists()" that returns the current singleton if it
> >> exists, or None otherwise. The Jupyter code would then be able to use
> >> this entrypoint to query Spark for an active Spark context, which it
> >> could use to probe the web URL.
> >>
> >> It's a minor change, but this would be my first contribution to Spark,
> >> and I want to make sure my plan was kosher before I implemented it.
> >>
> >> Thanks!
> >> Andrew
> >>
> >>
> >>
> >>
> >>
> >> [1] https://krishnan-r.github.io/sparkmonitor/
> >>
> >> [2] https://github.com/mozilla/jupyter-spark
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>


Re: SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
Hi Sean,

On Tue, Aug 7, 2018 at 5:16 PM, Sean Owen  wrote:
> Is SparkSession.getActiveSession what you're looking for?

Perhaps -- though there's not a corresponding python function, and I'm
not exactly sure how to call the scala getActiveSession without first
instantiating the python version and causing a JVM to start.

Is there an easy way to call getActiveSession that doesn't start a JVM?

Cheers
Andrew

>
> On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo  wrote:
>>
>> Hello,
>>
>> One pain point with various Jupyter extensions [1][2] that provide
>> visual feedback about running spark processes is the lack of a public
>> API to introspect the web URL. The notebook server needs to know the
>> URL to find information about the current SparkContext.
>>
>> Simply looking for "localhost:4040" works most of the time, but fails
>> if multiple spark notebooks are being run on the same host -- spark
>> increments the port for each new context, leading to confusion when
>> the notebooks are trying to probe the web interface for information.
>>
>> I'd like to implement an analog to SparkContext.getOrCreate(), perhaps
>> called "getIfExists()" that returns the current singleton if it
>> exists, or None otherwise. The Jupyter code would then be able to use
>> this entrypoint to query Spark for an active Spark context, which it
>> could use to probe the web URL.
>>
>> It's a minor change, but this would be my first contribution to Spark,
>> and I want to make sure my plan was kosher before I implemented it.
>>
>> Thanks!
>> Andrew
>>
>>
>>
>>
>>
>> [1] https://krishnan-r.github.io/sparkmonitor/
>>
>> [2] https://github.com/mozilla/jupyter-spark
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SparkContext singleton get w/o create?

2018-08-07 Thread Sean Owen
Is SparkSession.getActiveSession what you're looking for?

On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo  wrote:

> Hello,
>
> One pain point with various Jupyter extensions [1][2] that provide
> visual feedback about running spark processes is the lack of a public
> API to introspect the web URL. The notebook server needs to know the
> URL to find information about the current SparkContext.
>
> Simply looking for "localhost:4040" works most of the time, but fails
> if multiple spark notebooks are being run on the same host -- spark
> increments the port for each new context, leading to confusion when
> the notebooks are trying to probe the web interface for information.
>
> I'd like to implement an analog to SparkContext.getOrCreate(), perhaps
> called "getIfExists()" that returns the current singleton if it
> exists, or None otherwise. The Jupyter code would then be able to use
> this entrypoint to query Spark for an active Spark context, which it
> could use to probe the web URL.
>
> It's a minor change, but this would be my first contribution to Spark,
> and I want to make sure my plan was kosher before I implemented it.
>
> Thanks!
> Andrew
>
>
>
>
>
> [1] https://krishnan-r.github.io/sparkmonitor/
>
> [2] https://github.com/mozilla/jupyter-spark
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
Hello,

One pain point with various Jupyter extensions [1][2] that provide
visual feedback about running spark processes is the lack of a public
API to introspect the web URL. The notebook server needs to know the
URL to find information about the current SparkContext.

Simply looking for "localhost:4040" works most of the time, but fails
if multiple spark notebooks are being run on the same host -- spark
increments the port for each new context, leading to confusion when
the notebooks are trying to probe the web interface for information.

I'd like to implement an analog to SparkContext.getOrCreate(), perhaps
called "getIfExists()" that returns the current singleton if it
exists, or None otherwise. The Jupyter code would then be able to use
this entrypoint to query Spark for an active Spark context, which it
could use to probe the web URL.

It's a minor change, but this would be my first contribution to Spark,
and I want to make sure my plan was kosher before I implemented it.

Thanks!
Andrew





[1] https://krishnan-r.github.io/sparkmonitor/

[2] https://github.com/mozilla/jupyter-spark

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[build system] jenkins/github commit access exploit

2018-08-07 Thread shane knapp
TL;DR:  after seeing this pop up in my RSS feed early this morning, i
audited all of the "important" builds on our jenkins instance and
everything i found was properly masked from the outside world.

please take a moment and read this blog post:
https://medium.com/@vesirin/how-i-gained-commit-access-to-homebrew-in-30-minutes-2ae314df03ab

scary, huh?  :)

as stated in the TL;DR, i did two things:

1) using incognito browser windows, i spot checked spark release/publish
builds, as well as builds from our lab that i know have authenticated calls
to dockerhub and aws.

2) double-checked our permissions matrix for anonymous visitors to jenkins
and what they can see.

happily, i wasn't able to find any auth tokens or password that are
visible.  yay!

however, due to the large number of builds and people with access, i would
like to strongly remind everyone to be VERY VERY careful of how auth tokens
are passed around in builds.  there are masked 'password'-style env vars
for things like that, and are easily located in job configs.

we are not immune to exploits like this, so please be careful.

:)

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread John Zhuge
+1 on SPARK-25004. We have found it quite useful to diagnose PySpark OOM.

On Tue, Aug 7, 2018 at 1:21 PM Holden Karau  wrote:

> I'd like to suggest we consider  SPARK-25004  (hopefully it goes in soon),
> but solving some of the consistent Python memory issues we've had for years
> would be really amazing to get in.
>
> On Tue, Aug 7, 2018 at 1:07 PM, Tom Graves 
> wrote:
>
>> I would like to get clarification on our avro compatibility story before
>> the release.  anyone interested please look at -
>> https://issues.apache.org/jira/browse/SPARK-24924 . I probably should
>> have filed a separate jira and can if we don't resolve via discussion there.
>>
>> Tom
>>
>> On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp <
>> skn...@berkeley.edu> wrote:
>>
>>
>> According to the status, I think we should wait a few more days. Any
>> objections?
>>
>>
>> none here.
>>
>> i'm also pretty certain that waiting until after the code freeze to start
>> testing the GHPRB on ubuntu is the wisest course of action for us.
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


-- 
John


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Holden Karau
I'd like to suggest we consider  SPARK-25004  (hopefully it goes in soon),
but solving some of the consistent Python memory issues we've had for years
would be really amazing to get in.

On Tue, Aug 7, 2018 at 1:07 PM, Tom Graves 
wrote:

> I would like to get clarification on our avro compatibility story before
> the release.  anyone interested please look at -
> https://issues.apache.org/jira/browse/SPARK-24924 . I probably should
> have filed a separate jira and can if we don't resolve via discussion there.
>
> Tom
>
> On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp <
> skn...@berkeley.edu> wrote:
>
>
> According to the status, I think we should wait a few more days. Any
> objections?
>
>
> none here.
>
> i'm also pretty certain that waiting until after the code freeze to start
> testing the GHPRB on ubuntu is the wisest course of action for us.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Tom Graves
 I would like to get clarification on our avro compatibility story before the 
release.  anyone interested please look at - 
https://issues.apache.org/jira/browse/SPARK-24924 . I probably should have 
filed a separate jira and can if we don't resolve via discussion there.
Tom 
On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp 
 wrote:  
 
 
According to the status, I think we should wait a few more days. Any objections?


none here.
i'm also pretty certain that waiting until after the code freeze to start 
testing the GHPRB on ubuntu is the wisest course of action for us.
shane -- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
  

Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-07 Thread Steve Loughran
CVS with schema inference is a full read of the data, so that could be one of 
the problems. Do it at most once, print out the schema and use it from then on 
during ingress & use something else for persistence

On 6 Aug 2018, at 05:44, makatun 
mailto:d.i.maka...@gmail.com>> wrote:

 a. csv and parquet formats (parquet created from the same csv):
.format()
 b. schema-on-read on/off:  .option(inferSchema=)



[build system] bumped pull request builder job timeout to 400mins

2018-08-07 Thread shane knapp
i hate doing this, because our tests and builds take WY too long,
but this should help get PRs through before the code freeze.

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread shane knapp
>
> According to the status, I think we should wait a few more days. Any
> objections?
>
> none here.

i'm also pretty certain that waiting until after the code freeze to start
testing the GHPRB on ubuntu is the wisest course of action for us.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Handle BlockMissingException in pyspark

2018-08-07 Thread Divay Jindal
Hey John,

Spark version : 2.3
Hadoop version : Hadoop 2.6.0-cdh5.14.2

Is there anyway I can handle such an exception in spark code itself ( as
for a matter any other kind of exception) ?

On Aug 7, 2018 1:19 AM, "John Zhuge"  wrote:

BlockMissingException typically indicates the HDFS file is corrupted. Might
be an HDFS issue, Hadoop mailing list is a better bet:
u...@hadoop.apache.org.

Capture at the full stack trace in executor log.
If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693`
to determine whether it is corrupted.
If not corrupted, could there be excessive (thousands) current reads on the
block?
Hadoop version? Spark version?



On Mon, Aug 6, 2018 at 2:21 AM Divay Jindal 
wrote:

> Hi ,
>
> I am running pyspark in dockerized jupyter environment , I am constantly
> getting this error :
>
> ```
>
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 33 
> in stage 25.0 failed 1 times, most recent failure: Lost task 33.0 in stage 
> 25.0 (TID 35067, localhost, executor driver)
> : org.apache.hadoop.hdfs.BlockMissingException
> : Could not obtain block: 
> BP-1742911633-10.225.201.50-1479296658503:blk_1233169822_159765693
>
> ```
>
> Please can anyone help me with how to handle such exception in pyspark.
>
> --
> Best Regards
> *Divay Jindal*
>
>
>

-- 
John


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Wenchen Fan
Some updates for the JIRA tickets that we want to resolve before Spark 2.4.

green: merged
orange: in progress
red: likely to miss

SPARK-24374 : Support
Barrier Execution Mode in Apache Spark
The core functionality is finished, but we still need to add Python API.
Tracked by SPARK-24822 

SPARK-23899 : Built-in
SQL Function Improvement
I think it's ready to go. Although there are still some functions working
in progress, the common ones are all merged.

SPARK-14220 : Build and
test Spark against Scala 2.12
It's close, just one last piece. Tracked by SPARK-25029


SPARK-4502 : Spark SQL
reads unnecessary nested fields from Parquet
Being reviewed.

SPARK-24882 : data
source v2 API improvement
PR is out, being reviewed.

SPARK-24252 : Add
catalog support in Data Source V2
Being reviewed.

SPARK-24768 : Have a
built-in AVRO data source implementation
It's close, just one last piece: the decimal type support

SPARK-23243 :
Shuffle+Repartition
on an RDD could lead to incorrect answers
It turns out to be a very complicated issue, there is no consensus about
what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
long-standing issue, not a regression.

SPARK-24598 : Datatype
overflow conditions gives incorrect result
We decided to keep the current behavior in Spark 2.4 and add some
document(already done). We will re-consider this change in Spark 3.0.

SPARK-24020 : Sort-merge
join inner range optimization
There are some discussions about the design, I don't think we can get to a
consensus within Spark 2.4.

SPARK-24296 : replicating
large blocks over 2GB
Being reviewed.

SPARK-23874 : upgrade to
Apache Arrow 0.10.0
Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we
should wait a few days.


According to the status, I think we should wait a few more days. Any
objections?

Thanks,
Wenchen


On Tue, Aug 7, 2018 at 3:39 AM Sean Owen  wrote:

> ... and we still have a few snags with Scala 2.12 support at
> https://issues.apache.org/jira/browse/SPARK-25029
>
> There is some hope of resolving it on the order of a week, so for the
> moment, seems worth holding 2.4 for.
>
> On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:
>
>> Hi All,
>>
>> I'd like to request a few days extension to the code freeze to complete
>> the upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes
>> several key improvements and bug fixes.  The RC vote just passed this
>> morning and code changes are complete in
>> https://github.com/apache/spark/pull/21939. We just need some time for
>> the release artifacts to be available. Thoughts?
>>
>> Thanks,
>> Bryan
>>
>


Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-07 Thread 0xF0F0F0
This (and related JIRA tickets) might shed some light on the problem

http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html


Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On August 6, 2018 2:44 PM, makatun  wrote:

> It is well known that wide tables are not the most efficient way to organize
> data. However, sometimes we have to deal with extremely wide tables
> featuring thousands of columns. For example, loading data from legacy
> systems.
>
> *We have performed an investigation of how the number of columns affects the
> duration of Spark jobs. *
>
> Two basic Spark (2.3.1) jobs are used for testing. The two jobs use distinct
> approaches to instantiate a DataFrame. Each reads a .csv file into a
> DataFrame and performs count. Each job is repeated with input files having
> different number of columns and the execution time is measured. 16 files
> with 100 - 20,000 columns are used. The files are generated in such a way
> that their size (rows * columns) is constant (200,000 cells, approx. 2 MB).
> This means the files with more columns have fewer rows. Each job is repeated
> 7 times for each file, in order to accumulate better statistics.
>
> The results of the measurements are shown in the figure
> job_duration_VS_number_of_columns.jpg
> http://apache-spark-developers-list.1001551.n3.nabble.com/file/t3091/job_duration_VS_number_of_columns.jpg
> Significantly different complexity of DataFrame construction is observed for
> the two approaches:
>
> 1. spark.read.format(): similar results for
> a. csv and parquet formats (parquet created from the same csv):
> .format()
>
>   b.  schema-on-read on/off:  .option(inferSchema=)
>
>   c.  provided schema loaded from file (stored schema from previous
>
>
> run): .schema()
> Polynomial complexity on the number of columns is observed.
>
> // Get SparkSession
> val spark = SparkSession
> .builder
> .appName(s"TestSparkReadFormat${runNo}")
> .master("local[]")
> .config("spark.sql.warehouse.dir", "file:///C:/temp") // on Windows.
> .config("spark.debug.maxToStringFields", 2)
> .getOrCreate()
> // Read data
> val df = spark.read.format("csv")
> .option("sep", ",")
> .option("inferSchema", "false")
> .option("header", "true")
> .load(inputPath)
> // Count rows and columns
> val nRows = df.count()
> val nColumns = df.columns.length
> spark.stop()
> 2. spark.createDataFrame(rows, schema): where rows and schema are
> constructed by splitting lines of text file.
> Linear complexity on the number of columns is observed.
> // Get SparkSession
> val spark = SparkSession
> .builder
> .appName(s"TestSparkCreateDataFrame${runNo}")
> .master("local[]").config("spark.sql.warehouse.dir", "file:///C:/temp") // on 
> Windows.
> .config("spark.debug.maxToStringFields", 2)
> .getOrCreate()
>
> // load file
> val sc = spark.sparkContext
> val lines = sc.textFile(inputPath)
>
> //create schema from headers
> val headers = lines.first
> val fs = headers.split(",").map(f => StructField(f, StringType))
> val schema = StructType(fs)
>
> // read data
> val noheaders = lines.filter(_ != headers)
> val rows = noheaders.map(_.split(",")).map(a => Row.fromSeq(a))
>
> // create Data Frame
> val df: DataFrame = spark.createDataFrame(rows, schema)
>
> // count rows and columns
> val nRows = df.count()
> val nColumns = df.columns.length
> spark.stop()
>
> The similar polynomial complexity on the total number of columns in a
> DataFrame is also observed in more complex testing jobs. Those jobs perform
> the following transformations on the fixed number of columns:
> • Filter
> • GroupBy
> • Sum
> • withColumn
>
> What could be the reason for the polynomial dependence of the job duration
> on the number of columns? *What is an efficient way to address wide data
> using Spark?
> *
>
>
> ---
>
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> --
>
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org