Re: [DISCUSS][SQL] Control the number of output files

2018-08-06 Thread lukas nalezenec
Hi Koert,
There is no such Jira yet. We need SPARK-23889 before. You can find some
mentions in the design document inside 23889.
Best regards
Lukas

2018-08-06 18:34 GMT+02:00 Koert Kuipers :

> i went through the jiras targeting 2.4.0 trying to find a feature where
> spark would coalesce/repartition by size (so merge small files
> automatically), but didn't find it.
> can someone point me to it?
> thank you.
> best,
> koert
>
> On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers  wrote:
>
>> lukas,
>> what is the jira ticket for this? i would like to follow it's activity.
>> thanks!
>> koert
>>
>> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec 
>> wrote:
>>
>>> Hi,
>>> Yes, This feature is planned - Spark should be soon able to repartition
>>> output by size.
>>> Lukas
>>>
>>>
>>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang 
>>> napsal:
>>>
 Has there been any discussion to simply support Hive's merge small
 files configuration? It simply adds one additional stage to inspect size of
 each output file, recompute the desired parallelism to reach a target size,
 and runs a map-only coalesce before committing the final files. Since AFAIK
 SparkSQL already stages the final output commit, it seems feasible to
 respect this Hive config.

 https://community.hortonworks.com/questions/106987/hive-mult
 iple-small-files.html


 On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
 wrote:

> See some of the related discussion under https://github.com/apach
> e/spark/pull/21589
>
> If feels to me like we need some kind of user code mechanism to signal
> policy preferences to Spark. This could also include ways to signal
> scheduling policy, which could include things like scheduling pool and/or
> barrier scheduling. Some of those scheduling policies operate at 
> inherently
> different levels currently -- e.g. scheduling pools at the Job level
> (really, the thread local level in the current implementation) and barrier
> scheduling at the Stage level -- so it is not completely obvious how to
> unify all of these policy options/preferences/mechanism, or whether it is
> possible, but I think it is worth considering such things at a fairly high
> level of abstraction and try to unify and simplify before making things
> more complex with multiple policy mechanisms.
>
> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
> wrote:
>
>> Seems like a good idea in general. Do other systems have similar
>> concepts? In general it'd be easier if we can follow existing convention 
>> if
>> there is any.
>>
>>
>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge 
>> wrote:
>>
>>> Hi all,
>>>
>>> Many Spark users in my company are asking for a way to control the
>>> number of output files in Spark SQL. There are use cases to either 
>>> reduce
>>> or increase the number. The users prefer not to use function
>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>> write and deploy Scala/Java/Python code.
>>>
>>> Could we introduce a query hint for this purpose (similar to
>>> Broadcast Join Hints)?
>>>
>>> /*+ *COALESCE*(n, shuffle) */
>>>
>>> In general, is query hint is the best way to bring DF functionality
>>> to SQL without extending SQL syntax? Any suggestion is highly 
>>> appreciated.
>>>
>>> This requirement is not the same as SPARK-6221 that asked for
>>> auto-merging output files.
>>>
>>> Thanks,
>>> John Zhuge
>>>
>>
>>
>


Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread shane knapp
job configured, build running:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/3/

on the bright(er) side, since i tested the crap out of this build on the
new ubuntu nodes, i've set this new job to run there.  :)

shane

On Mon, Aug 6, 2018 at 12:46 PM, shane knapp  wrote:

> i'll get something set up quickly by hand today, and make a TODO to get
> the job config checked in to the jenkins job builder configs later this
> week.
>
> shane
>
> On Sun, Aug 5, 2018 at 7:10 AM, Sean Owen  wrote:
>
>> Shane et al - could we get a test job in Jenkins to test the Scala 2.12
>> build? I don't think I have the access or expertise for it, though I could
>> probably copy and paste a job. I think we just need to clone the, say,
>> master Maven Hadoop 2.7 job, and add two steps: run
>> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to the
>> profiles that are enabled.
>>
>> I can already see two test failures for the 2.12 build right now and will
>> try to fix those, but this should help verify whether the failures are
>> 'real' and detect them going forward.
>>
>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Handle BlockMissingException in pyspark

2018-08-06 Thread John Zhuge
BlockMissingException typically indicates the HDFS file is corrupted. Might
be an HDFS issue, Hadoop mailing list is a better bet:
u...@hadoop.apache.org.

Capture at the full stack trace in executor log.
If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693`
to determine whether it is corrupted.
If not corrupted, could there be excessive (thousands) current reads on the
block?
Hadoop version? Spark version?



On Mon, Aug 6, 2018 at 2:21 AM Divay Jindal 
wrote:

> Hi ,
>
> I am running pyspark in dockerized jupyter environment , I am constantly
> getting this error :
>
> ```
>
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 33 
> in stage 25.0 failed 1 times, most recent failure: Lost task 33.0 in stage 
> 25.0 (TID 35067, localhost, executor driver)
> : org.apache.hadoop.hdfs.BlockMissingException
> : Could not obtain block: 
> BP-1742911633-10.225.201.50-1479296658503:blk_1233169822_159765693
>
> ```
>
> Please can anyone help me with how to handle such exception in pyspark.
>
> --
> Best Regards
> *Divay Jindal*
>
>
>

-- 
John


Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread shane knapp
i'll get something set up quickly by hand today, and make a TODO to get the
job config checked in to the jenkins job builder configs later this week.

shane

On Sun, Aug 5, 2018 at 7:10 AM, Sean Owen  wrote:

> Shane et al - could we get a test job in Jenkins to test the Scala 2.12
> build? I don't think I have the access or expertise for it, though I could
> probably copy and paste a job. I think we just need to clone the, say,
> master Maven Hadoop 2.7 job, and add two steps: run
> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to the
> profiles that are enabled.
>
> I can already see two test failures for the 2.12 build right now and will
> try to fix those, but this should help verify whether the failures are
> 'real' and detect them going forward.
>
>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Sean Owen
... and we still have a few snags with Scala 2.12 support at
https://issues.apache.org/jira/browse/SPARK-25029

There is some hope of resolving it on the order of a week, so for the
moment, seems worth holding 2.4 for.

On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:

> Hi All,
>
> I'd like to request a few days extension to the code freeze to complete
> the upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes
> several key improvements and bug fixes.  The RC vote just passed this
> morning and code changes are complete in
> https://github.com/apache/spark/pull/21939. We just need some time for
> the release artifacts to be available. Thoughts?
>
> Thanks,
> Bryan
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Bryan Cutler
Hi All,

I'd like to request a few days extension to the code freeze to complete the
upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several
key improvements and bug fixes.  The RC vote just passed this morning and
code changes are complete in https://github.com/apache/spark/pull/21939. We
just need some time for the release artifacts to be available. Thoughts?

Thanks,
Bryan

On Wed, Aug 1, 2018, 5:34 PM shane knapp  wrote:

> ++ssuchter (who kindly set up the initial k8s builds while i hammered on
> the backend)
>
> while i'm pretty confident (read: 99%) that the pull request builds will
> work on the new ubuntu workers:
>
> 1) i'd like to do more stress testing of other spark builds (in progress)
> 2) i'd like to reimage more centos workers before moving the PRB due to
> potential executor starvation, and my lead sysadmin is out until next monday
> 3) we will need to get rid of the ubuntu-specific k8s builds and merge
> that functionality in to the existing PRB job.  after that:  testing and
> babysitting
>
> regarding (1):  if these damn builds didn't take 4+ hours, it would be
> going a lot quicker.  ;)
> regarding (2):  adding two more ubuntu workers would make me comfortable
> WRT number of available executors, and i will guarantee that can happen by
> EOD on the 7th.
> regarding (3):  this should take about a day, and realistically the
> earliest we can get this started is the 8th.  i haven't even had a chance
> to start looking at this stuff yet, either.
>
> if we push release by a week, i think i can get things sorted w/o
> impacting the release schedule.  there will still be a bunch of stuff to
> clean up from the old centos builds (specifically docs, packaging and
> release), but i'll leave the existing and working infrastructure in place
> for now.
>
> shane
>
> On Wed, Aug 1, 2018 at 4:39 PM, Erik Erlandson 
> wrote:
>
>> The PR for SparkR support on the kube back-end is completed, but waiting
>> for Shane to make some tweaks to the CI machinery for full testing support.
>> If the code freeze is being delayed, this PR could be merged as well.
>>
>> On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:
>>
>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>
>>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [DISCUSS][SQL] Control the number of output files

2018-08-06 Thread Koert Kuipers
i went through the jiras targeting 2.4.0 trying to find a feature where
spark would coalesce/repartition by size (so merge small files
automatically), but didn't find it.
can someone point me to it?
thank you.
best,
koert

On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers  wrote:

> lukas,
> what is the jira ticket for this? i would like to follow it's activity.
> thanks!
> koert
>
> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec  wrote:
>
>> Hi,
>> Yes, This feature is planned - Spark should be soon able to repartition
>> output by size.
>> Lukas
>>
>>
>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang 
>> napsal:
>>
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>>
>>> https://community.hortonworks.com/questions/106987/hive-mult
>>> iple-small-files.html
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra 
>>> wrote:
>>>
 See some of the related discussion under https://github.com/apach
 e/spark/pull/21589

 If feels to me like we need some kind of user code mechanism to signal
 policy preferences to Spark. This could also include ways to signal
 scheduling policy, which could include things like scheduling pool and/or
 barrier scheduling. Some of those scheduling policies operate at inherently
 different levels currently -- e.g. scheduling pools at the Job level
 (really, the thread local level in the current implementation) and barrier
 scheduling at the Stage level -- so it is not completely obvious how to
 unify all of these policy options/preferences/mechanism, or whether it is
 possible, but I think it is worth considering such things at a fairly high
 level of abstraction and try to unify and simplify before making things
 more complex with multiple policy mechanisms.

 On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin 
 wrote:

> Seems like a good idea in general. Do other systems have similar
> concepts? In general it'd be easier if we can follow existing convention 
> if
> there is any.
>
>
> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge  wrote:
>
>> Hi all,
>>
>> Many Spark users in my company are asking for a way to control the
>> number of output files in Spark SQL. There are use cases to either reduce
>> or increase the number. The users prefer not to use function
>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>> write and deploy Scala/Java/Python code.
>>
>> Could we introduce a query hint for this purpose (similar to
>> Broadcast Join Hints)?
>>
>> /*+ *COALESCE*(n, shuffle) */
>>
>> In general, is query hint is the best way to bring DF functionality
>> to SQL without extending SQL syntax? Any suggestion is highly 
>> appreciated.
>>
>> This requirement is not the same as SPARK-6221 that asked for
>> auto-merging output files.
>>
>> Thanks,
>> John Zhuge
>>
>
>


Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-06 Thread antonkulaga
I have the same problem with gene expressions data (
javascript:portalClient.browseDatasets.downloadFile('GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz','gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz')
 
where I have tens of thousands genes as columns. No idea why Spark is
slooow there



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-06 Thread makatun
It is well known that wide tables are not the most efficient way to organize
data. However, sometimes we have to deal with extremely wide tables
featuring thousands of columns. For example, loading data from legacy
systems.

*We have performed an investigation of how the number of columns affects the
duration of Spark jobs. *

Two basic Spark (2.3.1) jobs are used for testing. The two jobs use distinct
approaches to instantiate a DataFrame. Each reads a .csv file into a
DataFrame and performs count. Each job is repeated with input files having
different number of columns and the execution time is measured. 16 files
with 100 - 20,000 columns are used. The files are generated in such a way
that their size (rows * columns) is constant (200,000 cells, approx. 2 MB).
This means the files with more columns have fewer rows. Each job is repeated
7 times for each file, in order to accumulate better statistics.

The results of the measurements are shown in the figure
  job_duration_VS_number_of_columns.jpg

  
Significantly different complexity of DataFrame construction is observed for
the two approaches:

*1. spark.read.format()*: similar results for 
  a.csv and parquet formats (parquet created from the same csv): 
.format()
  b.schema-on-read on/off:  .option(inferSchema=) 
  c.provided schema loaded from file (stored schema from previous
run): .schema()
Polynomial  complexity on the number of columns is observed.

// Get SparkSession
val spark = SparkSession
  .builder
  .appName(s"TestSparkReadFormat${runNo}")
  .master("local[*]")
  .config("spark.sql.warehouse.dir", "file:///C:/temp") // on Windows.
  .config("spark.debug.maxToStringFields", 2)
  .getOrCreate()

// Read data  
val df = spark.read.format("csv")
  .option("sep", ",")
  .option("inferSchema", "false")
  .option("header", "true")
  .load(inputPath)

// Count rows and columns
val nRows = df.count()
val nColumns = df.columns.length
spark.stop()


*2. spark.createDataFrame(rows, schema)*: where rows and schema are
constructed by splitting lines of text file. 
Linear complexity on the number of columns is observed.

// Get SparkSession
val spark = SparkSession
  .builder
  .appName(s"TestSparkCreateDataFrame${runNo}")
  .master("local[*]")
  .config("spark.sql.warehouse.dir", "file:///C:/temp") // on Windows.
  .config("spark.debug.maxToStringFields", 2)
  .getOrCreate()

// load file
val sc = spark.sparkContext
val lines = sc.textFile(inputPath)

//create schema from headers
val headers = lines.first
val fs = headers.split(",").map(f => StructField(f, StringType))
val schema = StructType(fs)

// read data
val noheaders = lines.filter(_ != headers)
val rows = noheaders.map(_.split(",")).map(a => Row.fromSeq(a))

// create Data Frame
val df: DataFrame = spark.createDataFrame(rows, schema)

// count rows and columns
val nRows = df.count()
val nColumns = df.columns.length
spark.stop()

The similar polynomial complexity on the total number of columns in a
DataFrame is also observed in more complex testing jobs. Those jobs perform
the following transformations on the fixed number of columns:
•   Filter
•   GroupBy
•   Sum
•   withColumn

What could be the reason for the polynomial dependence of the job duration
on the number of columns? *What is an efficient way to address wide data
using Spark?
*



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.3.2 (RC3)

2018-08-06 Thread Saisai Shao
Yes, there'll be an RC4, still waiting for the fix of one issue.

Yuval Itzchakov  于2018年8月6日周一 下午6:10写道:

> Are there any plans to create an RC4? There's an important Kafka Source
> leak
> fix I've merged back to the 2.3 branch.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Stavros Kontopoulos
The root cause for a case where closure cleaner is involved is described
here: https://github.com/apache/spark/pull/22004/files#r207753682 but I am
also waiting for some feedback from Lukas Rytz why this even worked in
2.11.
If it is something that needs fix and can be fixed we will fix and add test
cases for sure. I do understand the UX issue and that is why I mentioned
this in the first place.
It is my concern too. Meanwhile sometimes adoption requires changes. Best
case only implementation changes. Worst case the way you use something
changes as well, not to mention that this is not the common scenario that
fails and the user has options. Wouldnt say that it is dentrimental but
anyway.
I propose we move the discussion to
https://issues.apache.org/jira/browse/SPARK-25029 as this is an umbrella
jira for this and others.
Anyway we are looking into this and also the janino thing.

Stavros

On Mon, Aug 6, 2018 at 1:18 PM, Mridul Muralidharan 
wrote:

>
> A spark user’s expectation would be that any closure which worked in 2.11
> will continue to work in 2.12 (exhibiting same behavior wrt functionality,
> serializability, etc).
> If there are behavioral changes, we will need to understand what they are
> - but expection would be that they are minimal (if any) source changes for
> users/libraries - requiring otherwise would be very detrimental to adoption.
>
> Do we know the root cause here ? I am not sure how well we test the
> cornercases in cleaner- if this was not caught by suite, perhaps we should
> augment it ...
>
> Regards
> Mridul
>
> On Mon, Aug 6, 2018 at 1:08 AM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Closure cleaner's initial purpose AFAIK is to clean the dependencies
>> brought in with outer pointers (compiler's side effect). With LMFs in
>> Scala 2.12 there are no outer pointers, that is why in the new design
>> document we kept the implementation minimal focusing on the return
>> statements (it was intentional). Also the majority of the generated
>> closures AFAIK are of type LMF.
>> Regarding references in the LMF body that was not part of the doc since
>> we expect the user not to point to non-serializable objects etc.
>> In all these cases you know you are adding references you shouldn't.
>> If users were used to another UX we can try fix it, not sure how well
>> this worked in the past though and if covered all cases.
>>
>> Regards,
>> Stavros
>>
>> On Mon, Aug 6, 2018 at 8:36 AM, Mridul Muralidharan 
>> wrote:
>>
>>> I agree, we should not work around the testcase but rather understand
>>> and fix the root cause.
>>> Closure cleaner should have null'ed out the references and allowed it
>>> to be serialized.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sun, Aug 5, 2018 at 8:38 PM Wenchen Fan  wrote:
>>> >
>>> > It seems to me that the closure cleaner fails to clean up something.
>>> The failed test case defines a serializable class inside the test case, and
>>> the class doesn't refer to anything in the outer class. Ideally it can be
>>> serialized after cleaning up the closure.
>>> >
>>> > This is somehow a very weird way to define a class, so I'm not sure
>>> how serious the problem is.
>>> >
>>> > On Mon, Aug 6, 2018 at 3:41 AM Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com> wrote:
>>> >>
>>> >> Makes sense, not sure if closure cleaning is related to the last one
>>> for example or others. The last one is a bit weird, unless I am missing
>>> something about the LegacyAccumulatorWrapper logic.
>>> >>
>>> >> Stavros
>>> >>
>>> >> On Sun, Aug 5, 2018 at 10:23 PM, Sean Owen  wrote:
>>> >>>
>>> >>> Yep that's what I did. There are more failures with different
>>> resolutions. I'll open a JIRA and PR and ping you, to make sure that the
>>> changes are all reasonable, and not an artifact of missing something about
>>> closure cleaning in 2.12.
>>> >>>
>>> >>> In the meantime having a 2.12 build up and running for master will
>>> just help catch these things.
>>> >>>
>>> >>> On Sun, Aug 5, 2018 at 2:16 PM Stavros Kontopoulos <
>>> stavros.kontopou...@lightbend.com> wrote:
>>> 
>>>  Hi Sean,
>>> 
>>>  I run a quick build so the failing tests seem to be:
>>> 
>>>  - SPARK-17644: After one stage is aborted for too many failed
>>> attempts, subsequent stagesstill behave correctly on fetch failures ***
>>> FAILED ***
>>>    A job with one fetch failure should eventually succeed
>>> (DAGSchedulerSuite.scala:2422)
>>> 
>>> 
>>>  - LegacyAccumulatorWrapper with AccumulatorParam that has no
>>> equals/hashCode *** FAILED ***
>>>    java.io.NotSerializableException: org.scalatest.Assertions$
>>> AssertionsHelper
>>>  Serialization stack:
>>>  - object not serializable (class: 
>>>  org.scalatest.Assertions$AssertionsHelper,
>>> value: org.scalatest.Assertions$AssertionsHelper@3bc5fc8f)
>>> 
>>> 
>>>  The last one can be fixed easily if you set class `MyData(val i:
>>> Int) extends Serializable `outside of the test suite. Fo

Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Mridul Muralidharan
A spark user’s expectation would be that any closure which worked in 2.11
will continue to work in 2.12 (exhibiting same behavior wrt functionality,
serializability, etc).
If there are behavioral changes, we will need to understand what they are -
but expection would be that they are minimal (if any) source changes for
users/libraries - requiring otherwise would be very detrimental to adoption.

Do we know the root cause here ? I am not sure how well we test the
cornercases in cleaner- if this was not caught by suite, perhaps we should
augment it ...

Regards
Mridul

On Mon, Aug 6, 2018 at 1:08 AM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Closure cleaner's initial purpose AFAIK is to clean the dependencies
> brought in with outer pointers (compiler's side effect). With LMFs in
> Scala 2.12 there are no outer pointers, that is why in the new design
> document we kept the implementation minimal focusing on the return
> statements (it was intentional). Also the majority of the generated
> closures AFAIK are of type LMF.
> Regarding references in the LMF body that was not part of the doc since we
> expect the user not to point to non-serializable objects etc.
> In all these cases you know you are adding references you shouldn't.
> If users were used to another UX we can try fix it, not sure how well this
> worked in the past though and if covered all cases.
>
> Regards,
> Stavros
>
> On Mon, Aug 6, 2018 at 8:36 AM, Mridul Muralidharan 
> wrote:
>
>> I agree, we should not work around the testcase but rather understand
>> and fix the root cause.
>> Closure cleaner should have null'ed out the references and allowed it
>> to be serialized.
>>
>> Regards,
>> Mridul
>>
>> On Sun, Aug 5, 2018 at 8:38 PM Wenchen Fan  wrote:
>> >
>> > It seems to me that the closure cleaner fails to clean up something.
>> The failed test case defines a serializable class inside the test case, and
>> the class doesn't refer to anything in the outer class. Ideally it can be
>> serialized after cleaning up the closure.
>> >
>> > This is somehow a very weird way to define a class, so I'm not sure how
>> serious the problem is.
>> >
>> > On Mon, Aug 6, 2018 at 3:41 AM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>> >>
>> >> Makes sense, not sure if closure cleaning is related to the last one
>> for example or others. The last one is a bit weird, unless I am missing
>> something about the LegacyAccumulatorWrapper logic.
>> >>
>> >> Stavros
>> >>
>> >> On Sun, Aug 5, 2018 at 10:23 PM, Sean Owen  wrote:
>> >>>
>> >>> Yep that's what I did. There are more failures with different
>> resolutions. I'll open a JIRA and PR and ping you, to make sure that the
>> changes are all reasonable, and not an artifact of missing something about
>> closure cleaning in 2.12.
>> >>>
>> >>> In the meantime having a 2.12 build up and running for master will
>> just help catch these things.
>> >>>
>> >>> On Sun, Aug 5, 2018 at 2:16 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>> 
>>  Hi Sean,
>> 
>>  I run a quick build so the failing tests seem to be:
>> 
>>  - SPARK-17644: After one stage is aborted for too many failed
>> attempts, subsequent stagesstill behave correctly on fetch failures ***
>> FAILED ***
>>    A job with one fetch failure should eventually succeed
>> (DAGSchedulerSuite.scala:2422)
>> 
>> 
>>  - LegacyAccumulatorWrapper with AccumulatorParam that has no
>> equals/hashCode *** FAILED ***
>>    java.io.NotSerializableException:
>> org.scalatest.Assertions$AssertionsHelper
>>  Serialization stack:
>>  - object not serializable (class:
>> org.scalatest.Assertions$AssertionsHelper, value:
>> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f)
>> 
>> 
>>  The last one can be fixed easily if you set class `MyData(val i:
>> Int) extends Serializable `outside of the test suite. For some reason
>> outers (not removed) are capturing
>>  the Scalatest stuff in 2.12.
>> 
>>  Let me know if we see the same failures.
>> 
>>  Stavros
>> 
>>  On Sun, Aug 5, 2018 at 5:10 PM, Sean Owen  wrote:
>> >
>> > Shane et al - could we get a test job in Jenkins to test the Scala
>> 2.12 build? I don't think I have the access or expertise for it, though I
>> could probably copy and paste a job. I think we just need to clone the,
>> say, master Maven Hadoop 2.7 job, and add two steps: run
>> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to the
>> profiles that are enabled.
>> >
>> > I can already see two test failures for the 2.12 build right now
>> and will try to fix those, but this should help verify whether the failures
>> are 'real' and detect them going forward.
>> >
>> >
>> 
>> >>
>> >>
>> >>
>>
>
>
>
>


Re: [VOTE] SPARK 2.3.2 (RC3)

2018-08-06 Thread Yuval Itzchakov
Are there any plans to create an RC4? There's an important Kafka Source leak
fix I've merged back to the 2.3 branch.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why is SQLImplicits an abstract class rather than a trait?

2018-08-06 Thread assaf.mendelson
The import will work for the trait but not for anyone implementing the trait. 
As for not having a master, it was just an example, the full example
contains some configurations.


Thanks, 
Assaf





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Handle BlockMissingException in pyspark

2018-08-06 Thread Divay Jindal
Hi ,

I am running pyspark in dockerized jupyter environment , I am constantly
getting this error :

```

Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 33 in stage 25.0 failed 1 times, most recent failure: Lost task
33.0 in stage 25.0 (TID 35067, localhost, executor driver)
: org.apache.hadoop.hdfs.BlockMissingException
: Could not obtain block:
BP-1742911633-10.225.201.50-1479296658503:blk_1233169822_159765693

```

Please can anyone help me with how to handle such exception in pyspark.

-- 
Best Regards
*Divay Jindal*


Re: Set up Scala 2.12 test build in Jenkins

2018-08-06 Thread Stavros Kontopoulos
Closure cleaner's initial purpose AFAIK is to clean the dependencies
brought in with outer pointers (compiler's side effect). With LMFs in Scala
2.12 there are no outer pointers, that is why in the new design document we
kept the implementation minimal focusing on the return statements (it was
intentional). Also the majority of the generated closures AFAIK are of type
LMF.
Regarding references in the LMF body that was not part of the doc since we
expect the user not to point to non-serializable objects etc.
In all these cases you know you are adding references you shouldn't.
If users were used to another UX we can try fix it, not sure how well this
worked in the past though and if covered all cases.

Regards,
Stavros

On Mon, Aug 6, 2018 at 8:36 AM, Mridul Muralidharan 
wrote:

> I agree, we should not work around the testcase but rather understand
> and fix the root cause.
> Closure cleaner should have null'ed out the references and allowed it
> to be serialized.
>
> Regards,
> Mridul
>
> On Sun, Aug 5, 2018 at 8:38 PM Wenchen Fan  wrote:
> >
> > It seems to me that the closure cleaner fails to clean up something. The
> failed test case defines a serializable class inside the test case, and the
> class doesn't refer to anything in the outer class. Ideally it can be
> serialized after cleaning up the closure.
> >
> > This is somehow a very weird way to define a class, so I'm not sure how
> serious the problem is.
> >
> > On Mon, Aug 6, 2018 at 3:41 AM Stavros Kontopoulos  lightbend.com> wrote:
> >>
> >> Makes sense, not sure if closure cleaning is related to the last one
> for example or others. The last one is a bit weird, unless I am missing
> something about the LegacyAccumulatorWrapper logic.
> >>
> >> Stavros
> >>
> >> On Sun, Aug 5, 2018 at 10:23 PM, Sean Owen  wrote:
> >>>
> >>> Yep that's what I did. There are more failures with different
> resolutions. I'll open a JIRA and PR and ping you, to make sure that the
> changes are all reasonable, and not an artifact of missing something about
> closure cleaning in 2.12.
> >>>
> >>> In the meantime having a 2.12 build up and running for master will
> just help catch these things.
> >>>
> >>> On Sun, Aug 5, 2018 at 2:16 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
> 
>  Hi Sean,
> 
>  I run a quick build so the failing tests seem to be:
> 
>  - SPARK-17644: After one stage is aborted for too many failed
> attempts, subsequent stagesstill behave correctly on fetch failures ***
> FAILED ***
>    A job with one fetch failure should eventually succeed
> (DAGSchedulerSuite.scala:2422)
> 
> 
>  - LegacyAccumulatorWrapper with AccumulatorParam that has no
> equals/hashCode *** FAILED ***
>    java.io.NotSerializableException: org.scalatest.Assertions$
> AssertionsHelper
>  Serialization stack:
>  - object not serializable (class: 
>  org.scalatest.Assertions$AssertionsHelper,
> value: org.scalatest.Assertions$AssertionsHelper@3bc5fc8f)
> 
> 
>  The last one can be fixed easily if you set class `MyData(val i: Int)
> extends Serializable `outside of the test suite. For some reason outers
> (not removed) are capturing
>  the Scalatest stuff in 2.12.
> 
>  Let me know if we see the same failures.
> 
>  Stavros
> 
>  On Sun, Aug 5, 2018 at 5:10 PM, Sean Owen  wrote:
> >
> > Shane et al - could we get a test job in Jenkins to test the Scala
> 2.12 build? I don't think I have the access or expertise for it, though I
> could probably copy and paste a job. I think we just need to clone the,
> say, master Maven Hadoop 2.7 job, and add two steps: run
> "./dev/change-scala-version.sh 2.12" first, then add "-Pscala-2.12" to the
> profiles that are enabled.
> >
> > I can already see two test failures for the 2.12 build right now and
> will try to fix those, but this should help verify whether the failures are
> 'real' and detect them going forward.
> >
> >
> 
> >>
> >>
> >>
>