Re: Output Committers for S3

2017-03-28 Thread Steve Loughran

> On 28 Mar 2017, at 05:20, sririshindra  wrote:
> 
> Hi 
> 
> I have a job which saves a dataframe as parquet file to s3.
> 
> The built a jar using your repository https://github.com/rdblue/s3committer.
> 
> I added the following config in the to the Spark Session 
> config("spark.hadoop.spark.sql.parquet.output.committer.class",
> "com.netflix.bdp.s3.S3PartitionedOutputCommitter")
> 
> 
> I submitted the job to spark 2.0.2 as follows 
> 
> ./bin/spark-submit --master local[*] --driver-memory 4G --jars
> /home/rishi/Downloads/hadoop-aws-2.7.3.jar,/home/rishi/Downloads/aws-java-sdk-1.7.4.jar,/home/user/Documents/s3committer/build/libs/s3committer-0.5.5.jar
> --driver-library-path
> /home/user/Downloads/hadoop-aws-2.7.3.jar,/home/user/Downloads/aws-java-sdk-1.7.4.jar,/home/user/Documents/s3committer/build/libs/s3committer-0.5.5.jar
>  
> --class main.streaming.scala.backupdatatos3.backupdatatos3Processorr
> --packages
> joda-time:joda-time:2.9.7,org.mongodb.mongo-hadoop:mongo-hadoop-core:1.5.2,org.mongodb:mongo-java-driver:3.3.0
> /home/user/projects/backupjob/target/Backup-1.0-SNAPSHOT.jar


The miracle of OSS is that you have the right to fix things, the curse, only 
you get to fix your problems on a timescale that suits


> 
> 
> I am gettig the following runtime exception.
> xception in thread "main" java.lang.RuntimeException:
> java.lang.RuntimeException: class
> com.netflix.bdp.s3.S3PartitionedOutputCommitter not
> org.apache.parquet.hadoop.ParquetOutputCommitter
>at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2227)
>at
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:81)
>at


here:
val committerClass =
  conf.getClass(
SQLConf.PARQUET_OUTPUT_COMMITTER_CLASS.key,
classOf[ParquetOutputCommitter],
classOf[ParquetOutputCommitter])


At a guess, Ryan's committer isn't a ParquetOutputCommitter. 

workarounds

1. Subclass ParquetOutputCommitter
2. Modify ParquetFileFormat to only look for a classOf[FileOutputFormat]; the 
ParquetOutputCommitter doesn't do anything other than optionally add a metadata 
file. As that is a performance killer on S3, you should have disabled that 
option already.

#2 is easiest., time to rebuild spark being the only overhead.

HADOOP-13786  is sneaking in Ryan's work underneath things, but even there the 
ParquetFileFormat is going to have trouble. Which is odd, given my integration 
tests did appear to be writing things. I'll take that as a sign of coverage 
problems




> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:108)
>at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
>at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
>at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
>at
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
>at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
>at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
>at
> main.streaming.scala.backupdatatos3.backupdatatos3Processorr$.main(backupdatatos3Processorr.scala:229)
>at
> main.streaming.scala.backupdatatos3.backupdatatos3Processorr.main(backupdatatos3Processorr.scala)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>at java.lang.reflect.Method.invoke(Method.java:498)
>at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>at
> 

Re: Fwd: [SparkSQL] Project using NamedExpression

2017-03-28 Thread Liang-Chi Hsieh

I am not sure why you want to transform rows in the dataframe using
mapPartitions like that.

If you want to project the rows with some expressions, you can use the API
like selectExpr and let Spark SQL to resolve expressions. To resolve
expressions manually, you need to (at least) deal with a resolver, and
transform the expressions recursively with LogicalPlan.resolve API.


Aviral Agarwal wrote
> Hi ,
> Can you please point me on how to resolve the expression ?
> I was looking into LogicalPlan.Resolve expression() that takes a Partial
> Function but I am not sure how to use that.
> 
> Thanks,
> Aviral Agarwal
> 
> On Mar 24, 2017 09:20, "Liang-Chi Hsieh" 

> viirya@

>  wrote:
> 
> 
> Hi,
> 
> You need to resolve the expressions before passing into creating
> UnsafeProjection.
> 
> 
> 
> Aviral Agarwal wrote
>> Hi guys,
>>
>> I want transform Row using NamedExpression.
>>
>> Below is the code snipped that I am using :
>>
>>
>> def apply(dataFrame: DataFrame, selectExpressions:
>> java.util.List[String]): RDD[UnsafeRow] = {
>>
>> val exprArray = selectExpressions.map(s =>
>>   Column(SqlParser.parseExpression(s)).named
>> )
>>
>> val inputSchema = dataFrame.logicalPlan.output
>>
>> val transformedRDD = dataFrame.mapPartitions(
>>   iter => {
>> val project = UnsafeProjection.create(exprArray,inputSchema)
>> iter.map{
>>   row =>
>> project(InternalRow.fromSeq(row.toSeq))
>> }
>> })
>>
>> transformedRDD
>>   }
>>
>>
>> The problem is that expression becomes unevaluable :
>>
>> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate
>> expression: 'a
>> at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.
>> genCode(Expression.scala:233)
>> at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.g
>> enCode(unresolved.scala:53)
>> at org.apache.spark.sql.catalyst.expressions.Expression$$anonfu
>> n$gen$2.apply(Expression.scala:106)
>> at org.apache.spark.sql.catalyst.expressions.Expression$$anonfu
>> n$gen$2.apply(Expression.scala:102)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.sql.catalyst.expressions.Expression.gen(Exp
>> ression.scala:102)
>> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
>> text$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
>> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
>> text$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>> at scala.collection.mutable.ResizableArray$class.foreach(Resiza
>> bleArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
>> scala:47)
>> at scala.collection.TraversableLike$class.map(TraversableLike.
>> scala:244)
>> at
>> scala.collection.AbstractTraversable.map(Traversable.scala:105)
>> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
>> text.generateExpressions(CodeGenerator.scala:464)
>> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
>> safeProjection$.createCode(GenerateUnsafeProjection.scala:281)
>> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
>> safeProjection$.create(GenerateUnsafeProjection.scala:324)
>> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
>> safeProjection$.create(GenerateUnsafeProjection.scala:317)
>> at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
>> safeProjection$.create(GenerateUnsafeProjection.scala:32)
>> at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera
>> tor.generate(CodeGenerator.scala:635)
>> at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.
>> create(Projection.scala:125)
>> at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.
>> create(Projection.scala:135)
>> at org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(
>> ScalaTransform.scala:31)
>> at org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(
>> ScalaTransform.scala:30)
>> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
>> apply$20.apply(RDD.scala:710)
>> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
>> apply$20.apply(RDD.scala:710)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.sca
>> la:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:214)
>> 

Re: Outstanding Spark 2.1.1 issues

2017-03-28 Thread Asher Krim
Hey Michael,
any update on this? We're itching for a 2.1.1 release (specifically
SPARK-14804 which is currently blocking us)

Thanks,
Asher Krim
Senior Software Engineer

On Wed, Mar 22, 2017 at 7:44 PM, Michael Armbrust 
wrote:

> An update: I cut the tag for RC1 last night.  Currently fighting with the
> release process.  Will post RC1 once I get it working.
>
> On Tue, Mar 21, 2017 at 2:16 PM, Nick Pentreath 
> wrote:
>
>> As for SPARK-19759 ,
>> I don't think that needs to be targeted for 2.1.1 so we don't need to worry
>> about it
>>
>>
>> On Tue, 21 Mar 2017 at 13:49 Holden Karau  wrote:
>>
>>> I agree with Michael, I think we've got some outstanding issues but none
>>> of them seem like regression from 2.1 so we should be good to start the RC
>>> process.
>>>
>>> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>> Please speak up if I'm wrong, but none of these seem like critical
>>> regressions from 2.1.  As such I'll start the RC process later today.
>>>
>>> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
>>> wrote:
>>>
>>> I'm not super sure it should be a blocker for 2.1.1 -- is it a
>>> regression? Maybe we can get TDs input on it?
>>>
>>> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>>>
>>> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
>>> blocker
>>>
>>> Best,
>>>
>>> Nan
>>>
>>> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung >> > wrote:
>>>
>>> I've been scrubbing R and think we are tracking 2 issues
>>>
>>> https://issues.apache.org/jira/browse/SPARK-19237
>>>
>>> https://issues.apache.org/jira/browse/SPARK-19925
>>>
>>>
>>>
>>>
>>> --
>>> *From:* holden.ka...@gmail.com  on behalf of
>>> Holden Karau 
>>> *Sent:* Monday, March 20, 2017 3:12:35 PM
>>> *To:* dev@spark.apache.org
>>> *Subject:* Outstanding Spark 2.1.1 issues
>>>
>>> Hi Spark Developers!
>>>
>>> As we start working on the Spark 2.1.1 release I've been looking at our
>>> outstanding issues still targeted for it. I've tried to break it down by
>>> component so that people in charge of each component can take a quick look
>>> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
>>> the overall list is pretty short (only 9 items - 5 if we only look at
>>> explicitly tagged) :)
>>>
>>> If your working on something for Spark 2.1.1 and it doesn't show up in
>>> this list please speak up now :) We have a lot of issues (including "in
>>> progress") that are listed as impacting 2.1.0, but they aren't targeted for
>>> 2.1.1 - if there is something you are working in their which should be
>>> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>>>
>>> The query string I used for looking at the 2.1.1 open issues is:
>>>
>>> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion =
>>> 2.1.1 OR cf[12310320] = "2.1.1") AND project = spark AND resolution =
>>> Unresolved ORDER BY priority DESC
>>>
>>> None of the open issues appear to be a regression from 2.1.0, but those
>>> seem more likely to show up during the RC process (thanks in advance to
>>> everyone testing their workloads :)) & generally none of them seem to be
>>>
>>> (Note: the cfs are for Target Version/s field)
>>>
>>> Critical Issues:
>>>  SQL:
>>>   SPARK-19690  - Join
>>> a streaming DataFrame with a batch DataFrame may not work - PR
>>> https://github.com/apache/spark/pull/17052 (review in progress by
>>> zsxwing, currently failing Jenkins)*
>>>
>>> Major Issues:
>>>  SQL:
>>>   SPARK-19035  - rand()
>>> function in case when cause failed - no outstanding PR (consensus on JIRA
>>> seems to be leaning towards it being a real issue but not necessarily
>>> everyone agrees just yet - maybe we should slip this?)*
>>>  Deploy:
>>>   SPARK-19522 
>>>  - --executor-memory flag doesn't work in local-cluster mode -
>>> https://github.com/apache/spark/pull/16975 (review in progress by
>>> vanzin, but PR currently stalled waiting on response) *
>>>  Core:
>>>   SPARK-20025  - Driver
>>> fail over will not work, if SPARK_LOCAL* env is set. -
>>> https://github.com/apache/spark/pull/17357 (waiting on review) *
>>>  PySpark:
>>>  SPARK-19955  -
>>> Update run-tests to support conda [ Part of Dropping 2.6 support -- which
>>> we shouldn't do in a minor release -- but also fixes pip installability
>>> tests to run in Jenkins ]-  PR failing Jenkins (I need to poke this some
>>> more, but seems like 2.7 support works but some other issues. Maybe slip to

Re: Outstanding Spark 2.1.1 issues

2017-03-28 Thread Michael Armbrust
We just fixed the build yesterday.  I'll kick off a new RC today.

On Tue, Mar 28, 2017 at 8:04 AM, Asher Krim  wrote:

> Hey Michael,
> any update on this? We're itching for a 2.1.1 release (specifically
> SPARK-14804 which is currently blocking us)
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>
> On Wed, Mar 22, 2017 at 7:44 PM, Michael Armbrust 
> wrote:
>
>> An update: I cut the tag for RC1 last night.  Currently fighting with the
>> release process.  Will post RC1 once I get it working.
>>
>> On Tue, Mar 21, 2017 at 2:16 PM, Nick Pentreath > > wrote:
>>
>>> As for SPARK-19759 ,
>>> I don't think that needs to be targeted for 2.1.1 so we don't need to worry
>>> about it
>>>
>>>
>>> On Tue, 21 Mar 2017 at 13:49 Holden Karau  wrote:
>>>
 I agree with Michael, I think we've got some outstanding issues but
 none of them seem like regression from 2.1 so we should be good to start
 the RC process.

 On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

 Please speak up if I'm wrong, but none of these seem like critical
 regressions from 2.1.  As such I'll start the RC process later today.

 On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
 wrote:

 I'm not super sure it should be a blocker for 2.1.1 -- is it a
 regression? Maybe we can get TDs input on it?

 On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:

 I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
 blocker

 Best,

 Nan

 On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung <
 felixcheun...@hotmail.com> wrote:

 I've been scrubbing R and think we are tracking 2 issues

 https://issues.apache.org/jira/browse/SPARK-19237

 https://issues.apache.org/jira/browse/SPARK-19925




 --
 *From:* holden.ka...@gmail.com  on behalf of
 Holden Karau 
 *Sent:* Monday, March 20, 2017 3:12:35 PM
 *To:* dev@spark.apache.org
 *Subject:* Outstanding Spark 2.1.1 issues

 Hi Spark Developers!

 As we start working on the Spark 2.1.1 release I've been looking at our
 outstanding issues still targeted for it. I've tried to break it down by
 component so that people in charge of each component can take a quick look
 and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
 the overall list is pretty short (only 9 items - 5 if we only look at
 explicitly tagged) :)

 If your working on something for Spark 2.1.1 and it doesn't show up in
 this list please speak up now :) We have a lot of issues (including "in
 progress") that are listed as impacting 2.1.0, but they aren't targeted for
 2.1.1 - if there is something you are working in their which should be
 targeted for 2.1.1 please let us know so it doesn't slip through the 
 cracks.

 The query string I used for looking at the 2.1.1 open issues is:

 ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion =
 2.1.1 OR cf[12310320] = "2.1.1") AND project = spark AND resolution =
 Unresolved ORDER BY priority DESC

 None of the open issues appear to be a regression from 2.1.0, but those
 seem more likely to show up during the RC process (thanks in advance to
 everyone testing their workloads :)) & generally none of them seem to be

 (Note: the cfs are for Target Version/s field)

 Critical Issues:
  SQL:
   SPARK-19690  - Join
 a streaming DataFrame with a batch DataFrame may not work - PR
 https://github.com/apache/spark/pull/17052 (review in progress by
 zsxwing, currently failing Jenkins)*

 Major Issues:
  SQL:
   SPARK-19035  - rand()
 function in case when cause failed - no outstanding PR (consensus on JIRA
 seems to be leaning towards it being a real issue but not necessarily
 everyone agrees just yet - maybe we should slip this?)*
  Deploy:
   SPARK-19522 
  - --executor-memory flag doesn't work in local-cluster mode -
 https://github.com/apache/spark/pull/16975 (review in progress by
 vanzin, but PR currently stalled waiting on response) *
  Core:
   SPARK-20025  - Driver
 fail over will not work, if SPARK_LOCAL* env is set. -
 https://github.com/apache/spark/pull/17357 (waiting on review) *
  PySpark:
  SPARK-19955  -
 Update run-tests to support 

Re: Output Committers for S3

2017-03-28 Thread Ryan Blue
Steve is right that the S3 committer isn't a ParquetOutputCommitter. I
think that the reason that check exists is to make sure Parquet writes
_metadata summary files to an output directory. But, I think the **summary
files are a bad idea**, so we bypass that logic and use the committer
directly if the output path is in S3.

Why are summary files a bad idea? Because they can easily get out of sync
with the real data files and cause correctness problems. There are two
reasons for using them, both optimizations to avoid reading all of the file
footers in a table. First, _metadata can be used to plan a job because it
has the row group offsets. But planning no longer reads all of the footers;
it uses regular Hadoop file splits instead. The second use is to get the
schema of a table more quickly, but this should be handled by a metastore
that tracks the latest schema. A metastore provides even faster access, a
more reliable schema, and can support schema evolution.

Even with the _metadata files, Spark has had to parallelize building a
table from Parquet files in S3 without a metastore, so I think this
requirement should be removed. In the mean time, you can probably just
build a version of the S3 committer that inherits from
ParquetOutputCommitter instead of FileOutputCommitter. That's probably the
easiest solution. Be sure you run the tests!

rb

On Tue, Mar 28, 2017 at 3:17 AM, Steve Loughran 
wrote:

>
> > On 28 Mar 2017, at 05:20, sririshindra  wrote:
> >
> > Hi
> >
> > I have a job which saves a dataframe as parquet file to s3.
> >
> > The built a jar using your repository https://github.com/rdblue/
> s3committer.
> >
> > I added the following config in the to the Spark Session
> > config("spark.hadoop.spark.sql.parquet.output.committer.class",
> > "com.netflix.bdp.s3.S3PartitionedOutputCommitter")
> >
> >
> > I submitted the job to spark 2.0.2 as follows
> >
> > ./bin/spark-submit --master local[*] --driver-memory 4G --jars
> > /home/rishi/Downloads/hadoop-aws-2.7.3.jar,/home/rishi/
> Downloads/aws-java-sdk-1.7.4.jar,/home/user/Documents/
> s3committer/build/libs/s3committer-0.5.5.jar
> > --driver-library-path
> > /home/user/Downloads/hadoop-aws-2.7.3.jar,/home/user/
> Downloads/aws-java-sdk-1.7.4.jar,/home/user/Documents/
> s3committer/build/libs/s3committer-0.5.5.jar
> > --class main.streaming.scala.backupdatatos3.backupdatatos3Processorr
> > --packages
> > joda-time:joda-time:2.9.7,org.mongodb.mongo-hadoop:mongo-
> hadoop-core:1.5.2,org.mongodb:mongo-java-driver:3.3.0
> > /home/user/projects/backupjob/target/Backup-1.0-SNAPSHOT.jar
>
>
> The miracle of OSS is that you have the right to fix things, the curse,
> only you get to fix your problems on a timescale that suits
>
>
> >
> >
> > I am gettig the following runtime exception.
> > xception in thread "main" java.lang.RuntimeException:
> > java.lang.RuntimeException: class
> > com.netflix.bdp.s3.S3PartitionedOutputCommitter not
> > org.apache.parquet.hadoop.ParquetOutputCommitter
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2227)
> >at
> > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.
> prepareWrite(ParquetFileFormat.scala:81)
> >at
>
>
> here:
> val committerClass =
>   conf.getClass(
> SQLConf.PARQUET_OUTPUT_COMMITTER_CLASS.key,
> classOf[ParquetOutputCommitter],
> classOf[ParquetOutputCommitter])
>
>
> At a guess, Ryan's committer isn't a ParquetOutputCommitter.
>
> workarounds
>
> 1. Subclass ParquetOutputCommitter
> 2. Modify ParquetFileFormat to only look for a classOf[FileOutputFormat];
> the ParquetOutputCommitter doesn't do anything other than optionally add a
> metadata file. As that is a performance killer on S3, you should have
> disabled that option already.
>
> #2 is easiest., time to rebuild spark being the only overhead.
>
> HADOOP-13786  is sneaking in Ryan's work underneath things, but even there
> the ParquetFileFormat is going to have trouble. Which is odd, given my
> integration tests did appear to be writing things. I'll take that as a sign
> of coverage problems
>
>
>
>
> > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(
> FileFormatWriter.scala:108)
> >at
> > org.apache.spark.sql.execution.datasources.
> InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationComm
> and.scala:101)
> >at
> > org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult$lzycompute(commands.scala:58)
> >at
> > org.apache.spark.sql.execution.command.ExecutedCommandExec.
> sideEffectResult(commands.scala:56)
> >at
> > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(
> commands.scala:74)
> >at
> > org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:114)
> >at
> > org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$1.apply(SparkPlan.scala:114)

Re: File JIRAs for all flaky test failures

2017-03-28 Thread Saikat Kanjilal
I'm happy to help out in this effort and will look at that label and see what 
tests I can look into and/or fix.



From: Kay Ousterhout 
Sent: Monday, March 27, 2017 9:47 PM
To: Reynold Xin
Cc: Saikat Kanjilal; Sean Owen; dev@spark.apache.org
Subject: Re: File JIRAs for all flaky test failures

Following up on this with a renewed plea to file JIRAs when you see flaky 
tests.  I did a quick skim of the PR builder, and there were 17 times in the 
last week when a flaky test led to a Jenkins failure, and someone re-ran the 
tests without filing (or updating) a JIRA (my apologies to anyone who was 
incorrectly added here):

cloud-fan (4)
gatorsmile (4)
wzhfy (4)
holdenk (1)
kunalkhamar (1)
hyukjinkwon (1)
scrapcodes (1)
srowen (1)

Are you on this list?  It's not too late to look at the test that failed and 
file (or update) the appropriate JIRA.

If you weren't convinced by my last email, here are some reasons to file a JIRA:

(0) Flaky tests are not always broken tests -- sometimes the underlying code is 
broken. (e.g., SPARK-19803, 
SPARK-19988, SPARK-19072 
 )

(1) Before a flaky test gets fixed, some human needs to file a JIRA.  The 
person who sees the flaky test via the PR builder is best suited to do this, 
because you already had to look at which test failed (to make sure it wasn't 
related to your change) and you know the test is flaky (because you're 
re-running it assuming it will succeed), at which point it takes <1 minute to 
file a JIRA

(2) Related to the above, existing automation is not sufficient.  Josh's tool 
is very useful for debugging flaky tests [1] but it does not yet automatically 
file JIRAs.  Many recent flaky tests might have shown up on the nifty 
interesting recent failures dashboard, but they didn't get noticed until they 
were failing for more than a week and dropped from that dashboard.  One recent 
test, SPARK-19990, was 
causing *every* Maven build to fail for over > 1 week, but was only noticed 
when someone filed a JIRA as a result of a flaky PR test.

(3) JIRAs result in helpful people fixing the tests!  If you're interested in 
doing this, the flaky test 
label
 is a good place to start.  Many thanks to folks who have recently helped with 
filing and fixing flaky tests: Sital Kedia, Song Jun, Xiao Li, Shubham Chopra, 
Shixiong Zhu, Genmao Yu, Imran Rashid, and I'm sure many more (this list is 
based on a quick JIRA skim).

Thanks!!

Kay



[1] You can create a URL using the "suite_name" and optionally "test_name" GET 
parameters in Josh's app to investigate a flaky test; e.g., to see how often 
the "hive bucketing is not supported" test in ShowCreateTableSuite has been 
failing: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.ShowCreateTableSuite_name=hive+bucketing+is+not+supported
 (be patient -- it takes a minute and sometimes a re-load to work).


On Thu, Feb 16, 2017 at 9:22 AM, Reynold Xin 
> wrote:
Josh's tool should give enough signal there already. I don't think we need some 
manual process to document them. If you want to work on those that'd be great. 
I bet you will get a lot of love because all developers hate flaky tests.


On Thu, Feb 16, 2017 at 6:19 PM, Saikat Kanjilal 
> wrote:

I am specifically suggesting documenting a list of the the flaky tests and 
fixing them, that's all.  To organize the effort I suggested tackling this by 
module.  Your second sentence is what I was trying to gauge from the community 
before putting anymore effort into this.



From: Sean Owen >
Sent: Thursday, February 16, 2017 8:45 AM
To: Saikat Kanjilal; dev@spark.apache.org

Subject: Re: File JIRAs for all flaky test failures

I'm not sure what you're specifically suggesting. Of course flaky tests are bad 
and they should be fixed, and people do. Yes, some are pretty hard to fix 
because they are rarely reproducible if at all. If you want to fix, fix; 
there's nothing more to it.

I don't perceive flaky tests to be a significant problem. It has gone from bad 
to occasional over the past year in my anecdotal experience.

On Thu, Feb 16, 2017 at 4:26 PM Saikat Kanjilal 
> wrote:

I'd just like to follow up again on this thread, should we devote some energy 
to fixing unit tests based on module, there wasn't much interest in this last 
time but given the nature of this thread I'd be