[jira] [Commented] (SPARK-16182) Utils.scala -- terminateProcess() should call Process.destroyForcibly() if and only if Process.destroy() fails
[ https://issues.apache.org/jira/browse/SPARK-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347869#comment-15347869 ] Sean Owen commented on SPARK-16182: --- OK that sounds more reasonable, go ahead. > Utils.scala -- terminateProcess() should call Process.destroyForcibly() if > and only if Process.destroy() fails > -- > > Key: SPARK-16182 > URL: https://issues.apache.org/jira/browse/SPARK-16182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: OSX El Capitan (java "1.8.0_65"), Oracle Linux 6 (java > 1.8.0_92-b14) >Reporter: Christian Chua >Priority: Critical > > Spark streaming documentation recommends application developers create static > connection pools. To clean up this pool, we add a shutdown hook. > The problem is that in spark 1.6.1, the shutdown hook for an executor will be > called only for the first submitted job. (on the second and subsequent job > submissions, the shutdown hook for the executor will NOT be invoked) > problem not seen when using java 1.7 > problem not seen when using spark 1.6.0 > looks like this bug is caused by this modification from 1.6.0 to 1.6.1: > https://issues.apache.org/jira/browse/SPARK-12486 > steps to reproduce the problem : > 1.) install spark 1.6.1 > 2.) submit this basic spark application > import org.apache.spark.{ SparkContext, SparkConf } > object MyPool { > def printToFile( f : java.io.File )( op : java.io.PrintWriter => Unit ) { > val p = new java.io.PrintWriter(f) > try { > op(p) > } > finally { > p.close() > } > } > def myfunc( ) = { > "a" > } > def createEvidence( ) = { > printToFile(new java.io.File("/var/tmp/evidence.txt")) { p => > p.println("the evidence") > } > } > sys.addShutdownHook { > createEvidence() > } > } > object BasicSpark { > def main( args : Array[String] ) = { > val sparkConf = new SparkConf().setAppName("BasicPi") > val sc = new SparkContext(sparkConf) > sc.parallelize(1 to 2).foreach { i => println("f : " + > MyPool.myfunc()) > } > sc.stop() > } > } > 3.) you will see that /var/tmp/evidence.txt is created > 4.) now delete this file > 5.) submit a second job > 6.) you will see that /var/tmp/evidence.txt is no longer created on the > second submission > 7.) if you use java 7 or spark 1.6.0, the evidence file will be created on > the second and subsequent submits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16182) Utils.scala -- terminateProcess() should call Process.destroyForcibly() if and only if Process.destroy() fails
[ https://issues.apache.org/jira/browse/SPARK-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Chua updated SPARK-16182: --- Summary: Utils.scala -- terminateProcess() should call Process.destroyForcibly() if and only if Process.destroy() fails (was: Utils.scala -- terminateProcess() should not call Process.destroyForcibly() before Process.destroy()) > Utils.scala -- terminateProcess() should call Process.destroyForcibly() if > and only if Process.destroy() fails > -- > > Key: SPARK-16182 > URL: https://issues.apache.org/jira/browse/SPARK-16182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: OSX El Capitan (java "1.8.0_65"), Oracle Linux 6 (java > 1.8.0_92-b14) >Reporter: Christian Chua >Priority: Critical > > Spark streaming documentation recommends application developers create static > connection pools. To clean up this pool, we add a shutdown hook. > The problem is that in spark 1.6.1, the shutdown hook for an executor will be > called only for the first submitted job. (on the second and subsequent job > submissions, the shutdown hook for the executor will NOT be invoked) > problem not seen when using java 1.7 > problem not seen when using spark 1.6.0 > looks like this bug is caused by this modification from 1.6.0 to 1.6.1: > https://issues.apache.org/jira/browse/SPARK-12486 > steps to reproduce the problem : > 1.) install spark 1.6.1 > 2.) submit this basic spark application > import org.apache.spark.{ SparkContext, SparkConf } > object MyPool { > def printToFile( f : java.io.File )( op : java.io.PrintWriter => Unit ) { > val p = new java.io.PrintWriter(f) > try { > op(p) > } > finally { > p.close() > } > } > def myfunc( ) = { > "a" > } > def createEvidence( ) = { > printToFile(new java.io.File("/var/tmp/evidence.txt")) { p => > p.println("the evidence") > } > } > sys.addShutdownHook { > createEvidence() > } > } > object BasicSpark { > def main( args : Array[String] ) = { > val sparkConf = new SparkConf().setAppName("BasicPi") > val sc = new SparkContext(sparkConf) > sc.parallelize(1 to 2).foreach { i => println("f : " + > MyPool.myfunc()) > } > sc.stop() > } > } > 3.) you will see that /var/tmp/evidence.txt is created > 4.) now delete this file > 5.) submit a second job > 6.) you will see that /var/tmp/evidence.txt is no longer created on the > second submission > 7.) if you use java 7 or spark 1.6.0, the evidence file will be created on > the second and subsequent submits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16182) Utils.scala -- terminateProcess() should not call Process.destroyForcibly() before Process.destroy()
[ https://issues.apache.org/jira/browse/SPARK-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Chua updated SPARK-16182: --- Summary: Utils.scala -- terminateProcess() should not call Process.destroyForcibly() before Process.destroy() (was: companion object's shutdownhook not being invoked) > Utils.scala -- terminateProcess() should not call Process.destroyForcibly() > before Process.destroy() > > > Key: SPARK-16182 > URL: https://issues.apache.org/jira/browse/SPARK-16182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: OSX El Capitan (java "1.8.0_65"), Oracle Linux 6 (java > 1.8.0_92-b14) >Reporter: Christian Chua >Priority: Critical > > Spark streaming documentation recommends application developers create static > connection pools. To clean up this pool, we add a shutdown hook. > The problem is that in spark 1.6.1, the shutdown hook for an executor will be > called only for the first submitted job. (on the second and subsequent job > submissions, the shutdown hook for the executor will NOT be invoked) > problem not seen when using java 1.7 > problem not seen when using spark 1.6.0 > looks like this bug is caused by this modification from 1.6.0 to 1.6.1: > https://issues.apache.org/jira/browse/SPARK-12486 > steps to reproduce the problem : > 1.) install spark 1.6.1 > 2.) submit this basic spark application > import org.apache.spark.{ SparkContext, SparkConf } > object MyPool { > def printToFile( f : java.io.File )( op : java.io.PrintWriter => Unit ) { > val p = new java.io.PrintWriter(f) > try { > op(p) > } > finally { > p.close() > } > } > def myfunc( ) = { > "a" > } > def createEvidence( ) = { > printToFile(new java.io.File("/var/tmp/evidence.txt")) { p => > p.println("the evidence") > } > } > sys.addShutdownHook { > createEvidence() > } > } > object BasicSpark { > def main( args : Array[String] ) = { > val sparkConf = new SparkConf().setAppName("BasicPi") > val sc = new SparkContext(sparkConf) > sc.parallelize(1 to 2).foreach { i => println("f : " + > MyPool.myfunc()) > } > sc.stop() > } > } > 3.) you will see that /var/tmp/evidence.txt is created > 4.) now delete this file > 5.) submit a second job > 6.) you will see that /var/tmp/evidence.txt is no longer created on the > second submission > 7.) if you use java 7 or spark 1.6.0, the evidence file will be created on > the second and subsequent submits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16182) companion object's shutdownhook not being invoked
[ https://issues.apache.org/jira/browse/SPARK-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347856#comment-15347856 ] Christian Chua commented on SPARK-16182: The 1.6.1 code uses Process.destroyForcibly when available : https://github.com/apache/spark/blob/v1.6.1/core/src/main/scala/org/apache/spark/util/Utils.scala#L1702 what I believe should happen is that terminateProcess should call Process.destroy first, then after some timeout check for whether the process is still alive or not. If still alive, then use destroy forcibly. It appears that destroy forcibly is calling "kill -9" and so the shutdown hook getting called is not guaranteed. But to answer your question, I believe this is a spark issue -- spark should properly manage the jvm's it spawns. > companion object's shutdownhook not being invoked > - > > Key: SPARK-16182 > URL: https://issues.apache.org/jira/browse/SPARK-16182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: OSX El Capitan (java "1.8.0_65"), Oracle Linux 6 (java > 1.8.0_92-b14) >Reporter: Christian Chua >Priority: Critical > > Spark streaming documentation recommends application developers create static > connection pools. To clean up this pool, we add a shutdown hook. > The problem is that in spark 1.6.1, the shutdown hook for an executor will be > called only for the first submitted job. (on the second and subsequent job > submissions, the shutdown hook for the executor will NOT be invoked) > problem not seen when using java 1.7 > problem not seen when using spark 1.6.0 > looks like this bug is caused by this modification from 1.6.0 to 1.6.1: > https://issues.apache.org/jira/browse/SPARK-12486 > steps to reproduce the problem : > 1.) install spark 1.6.1 > 2.) submit this basic spark application > import org.apache.spark.{ SparkContext, SparkConf } > object MyPool { > def printToFile( f : java.io.File )( op : java.io.PrintWriter => Unit ) { > val p = new java.io.PrintWriter(f) > try { > op(p) > } > finally { > p.close() > } > } > def myfunc( ) = { > "a" > } > def createEvidence( ) = { > printToFile(new java.io.File("/var/tmp/evidence.txt")) { p => > p.println("the evidence") > } > } > sys.addShutdownHook { > createEvidence() > } > } > object BasicSpark { > def main( args : Array[String] ) = { > val sparkConf = new SparkConf().setAppName("BasicPi") > val sc = new SparkContext(sparkConf) > sc.parallelize(1 to 2).foreach { i => println("f : " + > MyPool.myfunc()) > } > sc.stop() > } > } > 3.) you will see that /var/tmp/evidence.txt is created > 4.) now delete this file > 5.) submit a second job > 6.) you will see that /var/tmp/evidence.txt is no longer created on the > second submission > 7.) if you use java 7 or spark 1.6.0, the evidence file will be created on > the second and subsequent submits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16184) Support SparkSession.conf API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347834#comment-15347834 ] Apache Spark commented on SPARK-16184: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/13885 > Support SparkSession.conf API in SparkR > --- > > Key: SPARK-16184 > URL: https://issues.apache.org/jira/browse/SPARK-16184 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16184) Support SparkSession.conf API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16184: Assignee: Apache Spark > Support SparkSession.conf API in SparkR > --- > > Key: SPARK-16184 > URL: https://issues.apache.org/jira/browse/SPARK-16184 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16184) Support SparkSession.conf API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16184: Assignee: (was: Apache Spark) > Support SparkSession.conf API in SparkR > --- > > Key: SPARK-16184 > URL: https://issues.apache.org/jira/browse/SPARK-16184 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16184) Support SparkSession.conf API in SparkR
Felix Cheung created SPARK-16184: Summary: Support SparkSession.conf API in SparkR Key: SPARK-16184 URL: https://issues.apache.org/jira/browse/SPARK-16184 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.0.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13709) Spark unable to decode Avro when partitioned
[ https://issues.apache.org/jira/browse/SPARK-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-13709. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13865 [https://github.com/apache/spark/pull/13865] > Spark unable to decode Avro when partitioned > > > Key: SPARK-13709 > URL: https://issues.apache.org/jira/browse/SPARK-13709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Chris Miller >Assignee: Cheng Lian > Fix For: 2.0.0 > > > There is a problem decoding Avro data with SparkSQL when partitioned. The > schema and encoded data are valid -- I'm able to decode the data with the > avro-tools CLI utility. I'm also able to decode the data with non-partitioned > SparkSQL tables, Hive, other tools as well... except partitioned SparkSQL > schemas. > For a simple example, I took the example schema and data found in the Oracle > documentation here: > *Schema* > {code:javascript} > { > "type": "record", > "name": "MemberInfo", > "namespace": "avro", > "fields": [ > {"name": "name", "type": { > "type": "record", > "name": "FullName", > "fields": [ > {"name": "first", "type": "string"}, > {"name": "last", "type": "string"} > ] > }}, > {"name": "age", "type": "int"}, > {"name": "address", "type": { > "type": "record", > "name": "Address", > "fields": [ > {"name": "street", "type": "string"}, > {"name": "city", "type": "string"}, > {"name": "state", "type": "string"}, > {"name": "zip", "type": "int"} > ] > }} > ] > } > {code} > *Data* > {code:javascript} > { > "name": { > "first": "Percival", > "last": "Lowell" > }, > "age": 156, > "address": { > "street": "Mars Hill Rd", > "city": "Flagstaff", > "state": "AZ", > "zip": 86001 > } > } > {code} > *Create* (no partitions - works) > If I create with no partitions, I'm able to query the data just fine. > {code:sql} > CREATE EXTERNAL TABLE IF NOT EXISTS foo > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > LOCATION '/path/to/data/dir' > TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc'); > {code} > *Create* (partitions -- does NOT work) > If I create with no partitions, and then manually add a partition, all of my > queries return an error. (I need to manually add partitions because I cannot > control the structure of the data directories, so dynamic partitioning is not > an option.) > {code:sql} > CREATE EXTERNAL TABLE IF NOT EXISTS foo > PARTITIONED BY (ds STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > TBLPROPERTIES ('avro.schema.url'='/path/to/schema.avsc'); > ALTER TABLE foo ADD PARTITION (ds='1') LOCATION '/path/to/data/dir'; > {code} > The error: > {code} > spark-sql> SELECT * FROM foo WHERE ds = '1'; > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.
[jira] [Reopened] (SPARK-16146) Spark application failed by Yarn preempting
[ https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-16146: --- This shouldn't be called "Fixed" vis a vis Spark > Spark application failed by Yarn preempting > --- > > Key: SPARK-16146 > URL: https://issues.apache.org/jira/browse/SPARK-16146 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 > Environment: Amazon EC2, centos 6.6, > Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, > preemption and dynamic allocation enabled. >Reporter: Cong Feng > > Hi, > We are setting up our Spark cluster on amz ec2. We are using Spark Yarn > client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official > web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and > spark.shuffle.service.enabled. > During our test we found our Spark application frequently get killed when the > preemption happened. Mostly seems driver trying to send rpc to executor which > has been preempted before, also there are some connect rest by peer > exceptions which also cause job failed Below are the typical exceptions we > found: > 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49 > java.io.IOException: Failed to send RPC 5721681506291542850 to > nodexx.xx..ddns.xx.com/xx.xx.xx.xx:42857: > java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239) > at > org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567) > at > io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedChannelException > And > 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122 > 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in > connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 > requests outstanding when connection f
[jira] [Resolved] (SPARK-16146) Spark application failed by Yarn preempting
[ https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16146. --- Resolution: Not A Problem > Spark application failed by Yarn preempting > --- > > Key: SPARK-16146 > URL: https://issues.apache.org/jira/browse/SPARK-16146 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 > Environment: Amazon EC2, centos 6.6, > Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, > preemption and dynamic allocation enabled. >Reporter: Cong Feng > > Hi, > We are setting up our Spark cluster on amz ec2. We are using Spark Yarn > client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official > web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and > spark.shuffle.service.enabled. > During our test we found our Spark application frequently get killed when the > preemption happened. Mostly seems driver trying to send rpc to executor which > has been preempted before, also there are some connect rest by peer > exceptions which also cause job failed Below are the typical exceptions we > found: > 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49 > java.io.IOException: Failed to send RPC 5721681506291542850 to > nodexx.xx..ddns.xx.com/xx.xx.xx.xx:42857: > java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239) > at > org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567) > at > io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedChannelException > And > 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122 > 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in > connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 > requests outstanding when connection from > nodexx-xx-xx.
[jira] [Commented] (SPARK-16182) companion object's shutdownhook not being invoked
[ https://issues.apache.org/jira/browse/SPARK-16182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347760#comment-15347760 ] Sean Owen commented on SPARK-16182: --- This isn't a Spark issue though right? Whatever is going on concerns the JVM's shutdown hook or a process not dying. > companion object's shutdownhook not being invoked > - > > Key: SPARK-16182 > URL: https://issues.apache.org/jira/browse/SPARK-16182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: OSX El Capitan (java "1.8.0_65"), Oracle Linux 6 (java > 1.8.0_92-b14) >Reporter: Christian Chua >Priority: Critical > > Spark streaming documentation recommends application developers create static > connection pools. To clean up this pool, we add a shutdown hook. > The problem is that in spark 1.6.1, the shutdown hook for an executor will be > called only for the first submitted job. (on the second and subsequent job > submissions, the shutdown hook for the executor will NOT be invoked) > problem not seen when using java 1.7 > problem not seen when using spark 1.6.0 > looks like this bug is caused by this modification from 1.6.0 to 1.6.1: > https://issues.apache.org/jira/browse/SPARK-12486 > steps to reproduce the problem : > 1.) install spark 1.6.1 > 2.) submit this basic spark application > import org.apache.spark.{ SparkContext, SparkConf } > object MyPool { > def printToFile( f : java.io.File )( op : java.io.PrintWriter => Unit ) { > val p = new java.io.PrintWriter(f) > try { > op(p) > } > finally { > p.close() > } > } > def myfunc( ) = { > "a" > } > def createEvidence( ) = { > printToFile(new java.io.File("/var/tmp/evidence.txt")) { p => > p.println("the evidence") > } > } > sys.addShutdownHook { > createEvidence() > } > } > object BasicSpark { > def main( args : Array[String] ) = { > val sparkConf = new SparkConf().setAppName("BasicPi") > val sc = new SparkContext(sparkConf) > sc.parallelize(1 to 2).foreach { i => println("f : " + > MyPool.myfunc()) > } > sc.stop() > } > } > 3.) you will see that /var/tmp/evidence.txt is created > 4.) now delete this file > 5.) submit a second job > 6.) you will see that /var/tmp/evidence.txt is no longer created on the > second submission > 7.) if you use java 7 or spark 1.6.0, the evidence file will be created on > the second and subsequent submits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader
[ https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347750#comment-15347750 ] huangning commented on SPARK-5770: -- Because it is not support deleteJar or updataJar, so I can't update class without restart thrift server > Use addJar() to upload a new jar file to executor, it can't be added to > classloader > --- > > Key: SPARK-5770 > URL: https://issues.apache.org/jira/browse/SPARK-5770 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: meiyoula >Priority: Minor > > First use addJar() to upload a jar to the executor, then change the jar > content and upload it again. We can see the jar file in the local has be > updated, but the classloader still load the old one. The executor log has no > error or exception to point it. > I use spark-shell to test it. And set "spark.files.overwrite" is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16179) UDF explosion yielding empty dataframe fails
[ https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-16179: -- Assignee: Davies Liu > UDF explosion yielding empty dataframe fails > > > Key: SPARK-16179 > URL: https://issues.apache.org/jira/browse/SPARK-16179 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Vladimir Feinberg >Assignee: Davies Liu > > Command to replicate > https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0 > Resulting failure > https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347705#comment-15347705 ] Xiangrui Meng commented on SPARK-16000: --- We can distribute the work easily if you can help create sub-task JIRAs and list the models that requires automatic conversion. Having convertMatrixColumnsToML/FromML sounds good to me. > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: yuhao yang > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. The main incompatibility is the > vector column type change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16179) UDF explosion yielding empty dataframe fails
[ https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16179: --- Affects Version/s: 2.0.0 > UDF explosion yielding empty dataframe fails > > > Key: SPARK-16179 > URL: https://issues.apache.org/jira/browse/SPARK-16179 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Vladimir Feinberg > > Command to replicate > https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0 > Resulting failure > https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16181) Incorrect behavior for isNull filter
[ https://issues.apache.org/jira/browse/SPARK-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16181: Assignee: (was: Apache Spark) > Incorrect behavior for isNull filter > > > Key: SPARK-16181 > URL: https://issues.apache.org/jira/browse/SPARK-16181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kevin Chen > > Repro: > JavaRDD leftRdd = > javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x"))); > JavaRDD rightRdd = > javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y"))); > StructType schema = DataTypes.createStructType(ImmutableList.of( > DataTypes.createStructField("col", DataTypes.StringType, > true))); > Dataset left = sparkSession.createDataFrame(leftRdd, schema); > Dataset right = sparkSession.createDataFrame(rightRdd, schema); > // add a column to the right > Dataset withConstantColumn = right.withColumn("new", > functions.lit(true)); > // do a left join. Nothing matches; expect Dataset joined to have a > single row ['x', null, null] > Column joinCondition = left.col("col").equalTo(right.col("col")); > Dataset joined = left.join(withConstantColumn, joinCondition, > LeftOuter.toString()); > // filter for nulls, still expect the single row ['x', null, null] > Dataset filtered = joined.filter(functions.col("new").isNull()); > // This fails with 1 != 0 > Assert.assertEquals(1, filtered.count()); > [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16181) Incorrect behavior for isNull filter
[ https://issues.apache.org/jira/browse/SPARK-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16181: Assignee: Apache Spark > Incorrect behavior for isNull filter > > > Key: SPARK-16181 > URL: https://issues.apache.org/jira/browse/SPARK-16181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kevin Chen >Assignee: Apache Spark > > Repro: > JavaRDD leftRdd = > javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x"))); > JavaRDD rightRdd = > javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y"))); > StructType schema = DataTypes.createStructType(ImmutableList.of( > DataTypes.createStructField("col", DataTypes.StringType, > true))); > Dataset left = sparkSession.createDataFrame(leftRdd, schema); > Dataset right = sparkSession.createDataFrame(rightRdd, schema); > // add a column to the right > Dataset withConstantColumn = right.withColumn("new", > functions.lit(true)); > // do a left join. Nothing matches; expect Dataset joined to have a > single row ['x', null, null] > Column joinCondition = left.col("col").equalTo(right.col("col")); > Dataset joined = left.join(withConstantColumn, joinCondition, > LeftOuter.toString()); > // filter for nulls, still expect the single row ['x', null, null] > Dataset filtered = joined.filter(functions.col("new").isNull()); > // This fails with 1 != 0 > Assert.assertEquals(1, filtered.count()); > [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16179) UDF explosion yielding empty dataframe fails
[ https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16179: Assignee: Apache Spark > UDF explosion yielding empty dataframe fails > > > Key: SPARK-16179 > URL: https://issues.apache.org/jira/browse/SPARK-16179 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Vladimir Feinberg >Assignee: Apache Spark > > Command to replicate > https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0 > Resulting failure > https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16181) Incorrect behavior for isNull filter
[ https://issues.apache.org/jira/browse/SPARK-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347704#comment-15347704 ] Apache Spark commented on SPARK-16181: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13884 > Incorrect behavior for isNull filter > > > Key: SPARK-16181 > URL: https://issues.apache.org/jira/browse/SPARK-16181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kevin Chen > > Repro: > JavaRDD leftRdd = > javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x"))); > JavaRDD rightRdd = > javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y"))); > StructType schema = DataTypes.createStructType(ImmutableList.of( > DataTypes.createStructField("col", DataTypes.StringType, > true))); > Dataset left = sparkSession.createDataFrame(leftRdd, schema); > Dataset right = sparkSession.createDataFrame(rightRdd, schema); > // add a column to the right > Dataset withConstantColumn = right.withColumn("new", > functions.lit(true)); > // do a left join. Nothing matches; expect Dataset joined to have a > single row ['x', null, null] > Column joinCondition = left.col("col").equalTo(right.col("col")); > Dataset joined = left.join(withConstantColumn, joinCondition, > LeftOuter.toString()); > // filter for nulls, still expect the single row ['x', null, null] > Dataset filtered = joined.filter(functions.col("new").isNull()); > // This fails with 1 != 0 > Assert.assertEquals(1, filtered.count()); > [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16164) CombineFilters should keep the ordering in the logical plan
[ https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347701#comment-15347701 ] Cheng Lian commented on SPARK-16164: I'm posting a summary of our offline and GitHub discussion about this issue here for future reference. The case described in this ticket can be generalized as the following query plan fragment: {noformat} Filter p1 Project ... Filter p2 {noformat} In this fragment, predicate {{p1}} is actually a partial function, which means it's not defined for all possible input values (for example, {{scala.math.log}} is only defined for positive numbers). Thus this query plan relies on the fact that {{p1}} is evaluated after {{p2}}, so that {{p2}} can reduce the range of input values for {{p1}}. However, filter push-down and {{CombineFilters}} optimizes this fragment into: {noformat} Project ... Filter p1 && p2 {noformat} which forces {{p1}} to be evaluated before {{p2}}. This causes the exception because now {{p1}} may receive invalid input values that are supposed to be filtered out by {{p2}}. The problem here is that, the SQL optimizer should have the freedom to rearrange filter predicate evaluation order. For example, we may want to evaluate cheap predicates first to shortcut expensive predicates. However, to enable this kind of optimizations, essentially we require all filter predicates to be deterministic *full* functions, which is violated in the above case ({{p1}} is not a full function). [PR #13872|https://github.com/apache/spark/pull/13872] "fixes" the specific case mentioned in this ticket by adjusting optimization rule {{CombineFilters}}, which is safe. But in general, user applications should NOT make the assumption that filter predicates are always evaluated in the order they appear in the original query. > CombineFilters should keep the ordering in the logical plan > --- > > Key: SPARK-16164 > URL: https://issues.apache.org/jira/browse/SPARK-16164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Dongjoon Hyun > Fix For: 2.0.1, 2.1.0 > > > [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with > additional filters. It seems that during filter pushdown, we changed the > ordering in the logical plan. I'm not sure whether we should treat this as a > bug. > {code} > val df1 = (0 until 3).map(_.toString).toDF > val indexer = new StringIndexer() > .setInputCol("value") > .setOutputCol("idx") > .setHandleInvalid("skip") > .fit(df1) > val df2 = (0 until 5).map(_.toString).toDF > val predictions = indexer.transform(df2) > predictions.show() // this is okay > predictions.where('idx > 2).show() // this will throw an exception > {code} > Please see the notebook at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html > for error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15784: -- Target Version/s: 2.1.0 > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16133) model loading backward compatibility for ml.feature
[ https://issues.apache.org/jira/browse/SPARK-16133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16133: -- Assignee: yuhao yang > model loading backward compatibility for ml.feature > --- > > Key: SPARK-16133 > URL: https://issues.apache.org/jira/browse/SPARK-16133 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Assignee: yuhao yang > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16133) model loading backward compatibility for ml.feature
[ https://issues.apache.org/jira/browse/SPARK-16133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16133. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13844 [https://github.com/apache/spark/pull/13844] > model loading backward compatibility for ml.feature > --- > > Key: SPARK-16133 > URL: https://issues.apache.org/jira/browse/SPARK-16133 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16179) UDF explosion yielding empty dataframe fails
[ https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16179: Assignee: (was: Apache Spark) > UDF explosion yielding empty dataframe fails > > > Key: SPARK-16179 > URL: https://issues.apache.org/jira/browse/SPARK-16179 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Vladimir Feinberg > > Command to replicate > https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0 > Resulting failure > https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16179) UDF explosion yielding empty dataframe fails
[ https://issues.apache.org/jira/browse/SPARK-16179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347692#comment-15347692 ] Apache Spark commented on SPARK-16179: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/13883 > UDF explosion yielding empty dataframe fails > > > Key: SPARK-16179 > URL: https://issues.apache.org/jira/browse/SPARK-16179 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Vladimir Feinberg > > Command to replicate > https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0 > Resulting failure > https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16142) Group naive Bayes methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16142. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13877 [https://github.com/apache/spark/pull/13877] > Group naive Bayes methods in generated doc > -- > > Key: SPARK-16142 > URL: https://issues.apache.org/jira/browse/SPARK-16142 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Labels: starter > Fix For: 2.0.1, 2.1.0 > > > Follow SPARK-16107 and group the doc of spark.naiveBayes: spark.naiveBayes, > predict(NB), summary(NB), read/write.ml(NB) under Rd spark.naiveBayes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16177) model loading backward compatibility for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16177: -- Assignee: yuhao yang > model loading backward compatibility for ml.regression > -- > > Key: SPARK-16177 > URL: https://issues.apache.org/jira/browse/SPARK-16177 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16177) model loading backward compatibility for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16177. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13879 [https://github.com/apache/spark/pull/13879] > model loading backward compatibility for ml.regression > -- > > Key: SPARK-16177 > URL: https://issues.apache.org/jira/browse/SPARK-16177 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql
Matthew Porter created SPARK-16183: -- Summary: Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql Key: SPARK-16183 URL: https://issues.apache.org/jira/browse/SPARK-16183 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.6.1 Environment: Running on AWS EMR Reporter: Matthew Porter Hi, I have created a PySpark SQL-based tool which auto-generates a complex SQL command to be run via sqlContext.sql(cmd) based on a large number of parameters. As the number of input files to be filtered and joined in this query grows, so does the length of the SQL query. The tool runs fine up until about 200+ files are included in the join, at which point the SQL command becomes very long (~100K characters). It is only on these longer queries that Spark fails, throwing an exception due to what seems to be too much recursion occurring within the SparkSQL parser: {code} Traceback (most recent call last): ... merged_df = sqlsc.sql(cmd) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql. : java.lang.StackOverflowError at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Par
[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347576#comment-15347576 ] Yin Huai commented on SPARK-15345: -- [~m1lan] Do you use hive-site.xml to store the url to your metastore? > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.execution.python.EvaluateP
[jira] [Commented] (SPARK-16168) Spark sql can not read ORC table
[ https://issues.apache.org/jira/browse/SPARK-16168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347566#comment-15347566 ] AnfengYuan commented on SPARK-16168: Ok, to make it clear, I can reproduce it with following steps: 1. use hive(version: 1.2.1) to create an orc table {code}create table testorc(id bigint, name string) stored as orc;{code} 2. insert one line to this table {code}insert into testorc values(1, '1');{code} 3. start spark-sql shell, and desc this table, I got {code} spark-sql> desc testorc; 16/06/24 10:20:03 INFO execution.SparkSqlParser: Parsing command: desc testorc 16/06/24 10:20:03 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376 16/06/24 10:20:03 INFO scheduler.DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 1 output partitions 16/06/24 10:20:03 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (processCmd at CliDriver.java:376) 16/06/24 10:20:03 INFO scheduler.DAGScheduler: Parents of final stage: List() 16/06/24 10:20:03 INFO scheduler.DAGScheduler: Missing parents: List() 16/06/24 10:20:03 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at processCmd at CliDriver.java:376), which has no missing parents 16/06/24 10:20:03 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 912.3 MB) 16/06/24 10:20:03 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 912.3 MB) 16/06/24 10:20:03 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.178.37:2874 (size: 2.5 KB, free: 912.3 MB) 16/06/24 10:20:03 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996 16/06/24 10:20:03 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at processCmd at CliDriver.java:376) 16/06/24 10:20:03 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 16/06/24 10:20:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0, PROCESS_LOCAL, 5571 bytes) 16/06/24 10:20:03 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1) 16/06/24 10:20:03 INFO codegen.CodeGenerator: Code generated in 23.231239 ms 16/06/24 10:20:03 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 1046 bytes result sent to driver 16/06/24 10:20:03 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 78 ms on localhost (1/1) 16/06/24 10:20:03 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/06/24 10:20:03 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 0.079 s 16/06/24 10:20:03 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 0.097624 s id bigint NULL namestring NULL Time taken: 0.28 seconds, Fetched 2 row(s) 16/06/24 10:20:03 INFO CliDriver: Time taken: 0.28 seconds, Fetched 2 row(s) {code} 4. query this table {code}select * from testorc;{code} then I got error {code} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1429) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1416) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1416) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1638) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1597) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1586) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1872) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1885) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1898) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1912) at org.apache.spark.rd
[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347556#comment-15347556 ] Pedro Osorio edited comment on SPARK-14172 at 6/24/16 2:10 AM: --- [~lucasmf] I have been able to reproduce the issue by running the following code in a spark shell: {code} sqlContext.sql("drop table test_partition_predicate") sqlContext.sql("create table test_partition_predicate (col1 string) partitioned by (partition_col string)") sqlContext.sql("explain extended select * from test_partition_predicate where partition_col = '1' and rand() < 0.9").collect().foreach(println) {code} I have verified using community.cloud.databricks.com that in spark1.4 the physical plan uses the partition_col predicate in the HiveTableScan, but in versions 1.6 and 2.0, this only shows up in the filters section, which means the whole table is scanned. As [~jiangxb1987] pointed out, the issue is in collectProjectsAndFilters and was introduced in this patch, I believe: https://github.com/apache/spark/pull/8486 was (Author: pedrosorio): [~lucasmf] I have been able to reproduce the issue by running the following code in a spark shell: {code} sqlContext.sql("drop table test_partition_predicate") sqlContext.sql("create table test_partition_predicate (col1 string) partitioned by (partition_col string)") sqlContext.sql("explain extended select * from test_partition_predicate where partition_col = '1' and rand() < 0.9").collect().foreach(println) {code} I have verified using community.cloud.databricks.com that in spark1.4 the physical plan uses the partition_col predicate in the HiveTableScan, but in versions 1.6 and 2.0, this shows up in the filters section, which means the whole table is scanned. As [~jiangxb1987] pointed out, the issue is in collectProjectsAndFilters and was introduced in this patch, I believe: https://github.com/apache/spark/pull/8486 > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly
[ https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347556#comment-15347556 ] Pedro Osorio commented on SPARK-14172: -- [~lucasmf] I have been able to reproduce the issue by running the following code in a spark shell: {code} sqlContext.sql("drop table test_partition_predicate") sqlContext.sql("create table test_partition_predicate (col1 string) partitioned by (partition_col string)") sqlContext.sql("explain extended select * from test_partition_predicate where partition_col = '1' and rand() < 0.9").collect().foreach(println) {code} I have verified using community.cloud.databricks.com that in spark1.4 the physical plan uses the partition_col predicate in the HiveTableScan, but in versions 1.6 and 2.0, this shows up in the filters section, which means the whole table is scanned. As [~jiangxb1987] pointed out, the issue is in collectProjectsAndFilters and was introduced in this patch, I believe: https://github.com/apache/spark/pull/8486 > Hive table partition predicate not passed down correctly > > > Key: SPARK-14172 > URL: https://issues.apache.org/jira/browse/SPARK-14172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yingji Zhang >Priority: Critical > > When the hive sql contains nondeterministic fields, spark plan will not push > down the partition predicate to the HiveTableScan. For example: > {code} > -- consider following query which uses a random function to sample rows > SELECT * > FROM table_a > WHERE partition_col = 'some_value' > AND rand() < 0.01; > {code} > The spark plan will not push down the partition predicate to HiveTableScan > which ends up scanning all partitions data from the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16123) Avoid NegativeArraySizeException while reserving additional capacity in VectorizedColumnReader
[ https://issues.apache.org/jira/browse/SPARK-16123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-16123. --- Resolution: Resolved Assignee: Sameer Agarwal > Avoid NegativeArraySizeException while reserving additional capacity in > VectorizedColumnReader > -- > > Key: SPARK-16123 > URL: https://issues.apache.org/jira/browse/SPARK-16123 > Project: Spark > Issue Type: Bug >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > > Both off-heap and on-heap variants of ColumnVector.reserve() can > unfortunately overflow while reserving additional capacity during reads. > {code} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserveInternal(OnHeapColumnVector.java:461) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.reserve(OnHeapColumnVector.java:397) > at > org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:675) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:389) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:372) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:194) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:169) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16182) companion object's shutdownhook not being invoked
Christian Chua created SPARK-16182: -- Summary: companion object's shutdownhook not being invoked Key: SPARK-16182 URL: https://issues.apache.org/jira/browse/SPARK-16182 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1 Environment: OSX El Capitan (java "1.8.0_65"), Oracle Linux 6 (java 1.8.0_92-b14) Reporter: Christian Chua Priority: Critical Spark streaming documentation recommends application developers create static connection pools. To clean up this pool, we add a shutdown hook. The problem is that in spark 1.6.1, the shutdown hook for an executor will be called only for the first submitted job. (on the second and subsequent job submissions, the shutdown hook for the executor will NOT be invoked) problem not seen when using java 1.7 problem not seen when using spark 1.6.0 looks like this bug is caused by this modification from 1.6.0 to 1.6.1: https://issues.apache.org/jira/browse/SPARK-12486 steps to reproduce the problem : 1.) install spark 1.6.1 2.) submit this basic spark application import org.apache.spark.{ SparkContext, SparkConf } object MyPool { def printToFile( f : java.io.File )( op : java.io.PrintWriter => Unit ) { val p = new java.io.PrintWriter(f) try { op(p) } finally { p.close() } } def myfunc( ) = { "a" } def createEvidence( ) = { printToFile(new java.io.File("/var/tmp/evidence.txt")) { p => p.println("the evidence") } } sys.addShutdownHook { createEvidence() } } object BasicSpark { def main( args : Array[String] ) = { val sparkConf = new SparkConf().setAppName("BasicPi") val sc = new SparkContext(sparkConf) sc.parallelize(1 to 2).foreach { i => println("f : " + MyPool.myfunc()) } sc.stop() } } 3.) you will see that /var/tmp/evidence.txt is created 4.) now delete this file 5.) submit a second job 6.) you will see that /var/tmp/evidence.txt is no longer created on the second submission 7.) if you use java 7 or spark 1.6.0, the evidence file will be created on the second and subsequent submits -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16181) Incorrect behavior for isNull filter
[ https://issues.apache.org/jira/browse/SPARK-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347509#comment-15347509 ] Reynold Xin commented on SPARK-16181: - cc [~cloud_fan] > Incorrect behavior for isNull filter > > > Key: SPARK-16181 > URL: https://issues.apache.org/jira/browse/SPARK-16181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kevin Chen > > Repro: > JavaRDD leftRdd = > javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x"))); > JavaRDD rightRdd = > javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y"))); > StructType schema = DataTypes.createStructType(ImmutableList.of( > DataTypes.createStructField("col", DataTypes.StringType, > true))); > Dataset left = sparkSession.createDataFrame(leftRdd, schema); > Dataset right = sparkSession.createDataFrame(rightRdd, schema); > // add a column to the right > Dataset withConstantColumn = right.withColumn("new", > functions.lit(true)); > // do a left join. Nothing matches; expect Dataset joined to have a > single row ['x', null, null] > Column joinCondition = left.col("col").equalTo(right.col("col")); > Dataset joined = left.join(withConstantColumn, joinCondition, > LeftOuter.toString()); > // filter for nulls, still expect the single row ['x', null, null] > Dataset filtered = joined.filter(functions.col("new").isNull()); > // This fails with 1 != 0 > Assert.assertEquals(1, filtered.count()); > [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347497#comment-15347497 ] Wenchen Fan commented on SPARK-15345: - hi, can you open a new JIRA for the remaining problem? thanks! > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.execution.python.EvaluatePy
[jira] [Assigned] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation
[ https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3723: --- Assignee: (was: Apache Spark) > DecisionTree, RandomForest: Add more instrumentation > > > Key: SPARK-3723 > URL: https://issues.apache.org/jira/browse/SPARK-3723 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Some simple instrumentation would help advanced users understand performance, > and to check whether parameters (such as maxMemoryInMB) need to be tuned. > Most important instrumentation (simple): > * min, avg, max nodes per group > * number of groups (passes over data) > More advanced instrumentation: > * For each tree (or averaged over trees), training set accuracy after > training each level. This would be useful for visualizing learning behavior > (to convince oneself that model selection was being done correctly). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation
[ https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3723: --- Assignee: Apache Spark > DecisionTree, RandomForest: Add more instrumentation > > > Key: SPARK-3723 > URL: https://issues.apache.org/jira/browse/SPARK-3723 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > Some simple instrumentation would help advanced users understand performance, > and to check whether parameters (such as maxMemoryInMB) need to be tuned. > Most important instrumentation (simple): > * min, avg, max nodes per group > * number of groups (passes over data) > More advanced instrumentation: > * For each tree (or averaged over trees), training set accuracy after > training each level. This would be useful for visualizing learning behavior > (to convince oneself that model selection was being done correctly). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation
[ https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347489#comment-15347489 ] Apache Spark commented on SPARK-3723: - User 'smurching' has created a pull request for this issue: https://github.com/apache/spark/pull/13881 > DecisionTree, RandomForest: Add more instrumentation > > > Key: SPARK-3723 > URL: https://issues.apache.org/jira/browse/SPARK-3723 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Some simple instrumentation would help advanced users understand performance, > and to check whether parameters (such as maxMemoryInMB) need to be tuned. > Most important instrumentation (simple): > * min, avg, max nodes per group > * number of groups (passes over data) > More advanced instrumentation: > * For each tree (or averaged over trees), training set accuracy after > training each level. This would be useful for visualizing learning behavior > (to convince oneself that model selection was being done correctly). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16181) Incorrect behavior for isNull filter
Kevin Chen created SPARK-16181: -- Summary: Incorrect behavior for isNull filter Key: SPARK-16181 URL: https://issues.apache.org/jira/browse/SPARK-16181 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Kevin Chen Repro: JavaRDD leftRdd = javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("x"))); JavaRDD rightRdd = javaSparkContext.parallelize(ImmutableList.of(RowFactory.create("y"))); StructType schema = DataTypes.createStructType(ImmutableList.of( DataTypes.createStructField("col", DataTypes.StringType, true))); Dataset left = sparkSession.createDataFrame(leftRdd, schema); Dataset right = sparkSession.createDataFrame(rightRdd, schema); // add a column to the right Dataset withConstantColumn = right.withColumn("new", functions.lit(true)); // do a left join. Nothing matches; expect Dataset joined to have a single row ['x', null, null] Column joinCondition = left.col("col").equalTo(right.col("col")); Dataset joined = left.join(withConstantColumn, joinCondition, LeftOuter.toString()); // filter for nulls, still expect the single row ['x', null, null] Dataset filtered = joined.filter(functions.col("new").isNull()); // This fails with 1 != 0 Assert.assertEquals(1, filtered.count()); [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16168) Spark sql can not read ORC table
[ https://issues.apache.org/jira/browse/SPARK-16168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347481#comment-15347481 ] Hyukjin Kwon commented on SPARK-16168: -- I think it would be nicer if there are some cores to reproduce this. > Spark sql can not read ORC table > > > Key: SPARK-16168 > URL: https://issues.apache.org/jira/browse/SPARK-16168 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: AnfengYuan > > When using spark-sql shell to query orc table, exceptions are thrown: > My table was generated by the tool in > https://github.com/hortonworks/hive-testbench > {code} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1429) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1416) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1416) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1638) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1597) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1586) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1872) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1885) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1898) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:310) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:131) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$3.apply(QueryExecution.scala:130) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:130) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:323) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:239) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.IllegalArgumentException: Field "i_item_sk" does not > exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:254) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:254) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructTy
[jira] [Closed] (SPARK-16146) Spark application failed by Yarn preempting
[ https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cong Feng closed SPARK-16146. - Resolution: Fixed > Spark application failed by Yarn preempting > --- > > Key: SPARK-16146 > URL: https://issues.apache.org/jira/browse/SPARK-16146 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 > Environment: Amazon EC2, centos 6.6, > Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, > preemption and dynamic allocation enabled. >Reporter: Cong Feng > > Hi, > We are setting up our Spark cluster on amz ec2. We are using Spark Yarn > client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official > web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and > spark.shuffle.service.enabled. > During our test we found our Spark application frequently get killed when the > preemption happened. Mostly seems driver trying to send rpc to executor which > has been preempted before, also there are some connect rest by peer > exceptions which also cause job failed Below are the typical exceptions we > found: > 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49 > java.io.IOException: Failed to send RPC 5721681506291542850 to > nodexx.xx..ddns.xx.com/xx.xx.xx.xx:42857: > java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239) > at > org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567) > at > io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedChannelException > And > 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122 > 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in > connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > 16/06/19 22:33:14 ERROR client.TransportResponseHandler: Still have 2 > requests outstanding when connection from > nodexx-xx-xx..ddns.xx
[jira] [Commented] (SPARK-16146) Spark application failed by Yarn preempting
[ https://issues.apache.org/jira/browse/SPARK-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347479#comment-15347479 ] Cong Feng commented on SPARK-16146: --- Finally it turned out to be the Zeppelin issue, we run the same job on Spark shell, we saw the exactly same exception but seems shell is able to handle to it and keep rest of tasks running to the end. While the zeppelin consider those exception as fatal and fail the job (at least at UI level). Thanks guys for all the help, for now I resolve the ticket. > Spark application failed by Yarn preempting > --- > > Key: SPARK-16146 > URL: https://issues.apache.org/jira/browse/SPARK-16146 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 > Environment: Amazon EC2, centos 6.6, > Spark-1.6.1-bin-hadoop-2.6(binary from spark official web), Hadoop 2.7.2, > preemption and dynamic allocation enabled. >Reporter: Cong Feng > > Hi, > We are setting up our Spark cluster on amz ec2. We are using Spark Yarn > client mode, which is Spark-1.6.1-bin-hadoop-2.6(binary from spark official > web) and Hadoop 2.7.2. We also enable preemption, dynamic allocation and > spark.shuffle.service.enabled. > During our test we found our Spark application frequently get killed when the > preemption happened. Mostly seems driver trying to send rpc to executor which > has been preempted before, also there are some connect rest by peer > exceptions which also cause job failed Below are the typical exceptions we > found: > 16/06/22 08:13:30 ERROR spark.ContextCleaner: Error cleaning RDD 49 > java.io.IOException: Failed to send RPC 5721681506291542850 to > nodexx.xx..ddns.xx.com/xx.xx.xx.xx:42857: > java.nio.channels.ClosedChannelException > at > org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239) > at > org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567) > at > io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:801) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:699) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1122) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:633) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:32) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:908) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:960) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedChannelException > And > 16/06/19 22:33:14 INFO storage.BlockManager: Removing RDD 122 > 16/06/19 22:33:14 WARN server.TransportChannelHandler: Exception in > connection from nodexx-xx-xx.xx.ddns.xx.com/xx.xx.xx.xx:56618 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) > at > io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEv
[jira] [Commented] (SPARK-16172) SQL Context's
[ https://issues.apache.org/jira/browse/SPARK-16172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347477#comment-15347477 ] Hyukjin Kwon commented on SPARK-16172: -- Do you mind updating the title to be in more details and testing this against newer versions? > SQL Context's > -- > > Key: SPARK-16172 > URL: https://issues.apache.org/jira/browse/SPARK-16172 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.2.2 >Reporter: Scott Viteri > Original Estimate: 1h > Remaining Estimate: 1h > > Read operations on unsupported json files hang up instead of fail. > The jsonFile and jsonRDD methods take arbitrarily long when trying to read > mutiple gzipped json files. If the RDD creation is not going to succeed, it > should fail quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347465#comment-15347465 ] MIN-FU YANG commented on SPARK-15516: - OK, I've reproduced the problem with following code: ``` import sqlContext.implicits._ val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.write.parquet("data/test_table/key=1") val df2 = sc.makeRDD(6 to 10).map(i => (i * 2147483647000l, i)).toDF("single", "triple") df2.write.parquet("data/test_table/key=2") val df3 = sqlContext.read.option("mergeSchema", "true").parquet("data/test_table") df3.registerTempTable("df3") sqlContext.sql("CREATE TABLE new_key_value_store LOCATION '/Users/lucasmf/data/new_key_value_store' AS select * from df3"); sqlContext.sql("SELECT * FROM new_key_value_store") ``` I'll looking into it. > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MIN-FU YANG updated SPARK-15516: Comment: was deleted (was: I would like to look into this issue) > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions
[ https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16178: Assignee: (was: Apache Spark) > SQL - Hive writer should not require partition names to match table partitions > -- > > Key: SPARK-16178 > URL: https://issues.apache.org/jira/browse/SPARK-16178 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ryan Blue > > SPARK-14459 added a check that the {{partition}} metadata on > {{InsertIntoTable}} must match the table's partition column names. But if > {{partitionBy}} is used to set up partition columns, those columns may not be > named or the names may not match. > For example: > {code} > // Tables: > // CREATE TABLE src (id string, date int, hour int, timestamp bigint); > // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) > // PARTITIONED BY (utc_dateint int, utc_hour int); > spark.table("src").write.partitionBy("date", "hour").insertInto("dest") > {code} > The call to partitionBy correctly places the date and hour columns at the end > of the logical plan, but the names don't match the "utc_" prefix and the > write fails. But the analyzer will verify the types and insert an {{Alias}} > so the query is actually valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions
[ https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347462#comment-15347462 ] Apache Spark commented on SPARK-16178: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/13880 > SQL - Hive writer should not require partition names to match table partitions > -- > > Key: SPARK-16178 > URL: https://issues.apache.org/jira/browse/SPARK-16178 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ryan Blue > > SPARK-14459 added a check that the {{partition}} metadata on > {{InsertIntoTable}} must match the table's partition column names. But if > {{partitionBy}} is used to set up partition columns, those columns may not be > named or the names may not match. > For example: > {code} > // Tables: > // CREATE TABLE src (id string, date int, hour int, timestamp bigint); > // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) > // PARTITIONED BY (utc_dateint int, utc_hour int); > spark.table("src").write.partitionBy("date", "hour").insertInto("dest") > {code} > The call to partitionBy correctly places the date and hour columns at the end > of the logical plan, but the names don't match the "utc_" prefix and the > write fails. But the analyzer will verify the types and insert an {{Alias}} > so the query is actually valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions
[ https://issues.apache.org/jira/browse/SPARK-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16178: Assignee: Apache Spark > SQL - Hive writer should not require partition names to match table partitions > -- > > Key: SPARK-16178 > URL: https://issues.apache.org/jira/browse/SPARK-16178 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Ryan Blue >Assignee: Apache Spark > > SPARK-14459 added a check that the {{partition}} metadata on > {{InsertIntoTable}} must match the table's partition column names. But if > {{partitionBy}} is used to set up partition columns, those columns may not be > named or the names may not match. > For example: > {code} > // Tables: > // CREATE TABLE src (id string, date int, hour int, timestamp bigint); > // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) > // PARTITIONED BY (utc_dateint int, utc_hour int); > spark.table("src").write.partitionBy("date", "hour").insertInto("dest") > {code} > The call to partitionBy correctly places the date and hour columns at the end > of the logical plan, but the names don't match the "utc_" prefix and the > write fails. But the analyzer will verify the types and insert an {{Alias}} > so the query is actually valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16180) Task hang on fetching blocks (cached RDD)
[ https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16180: --- Description: Here is the stackdump of executor: {code} sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) scala.concurrent.Await$.result(package.scala:107) org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) org.apache.spark.storage.BlockManager.get(BlockManager.scala:630) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) org.apache.spark.rdd.RDD.iterator(RDD.scala:268) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46) org.apache.spark.scheduler.Task.run(Task.scala:96) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) {code} was: {code} sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) scala.concurrent.Await$.result(package.scala:107) org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) org.apache.spark.storage.BlockManager.get(BlockManager.scala:630) org.apache.spark.CacheMa
[jira] [Updated] (SPARK-16180) Task hang on fetching blocks (cached RDD)
[ https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16180: --- Affects Version/s: 1.6.1 > Task hang on fetching blocks (cached RDD) > - > > Key: SPARK-16180 > URL: https://issues.apache.org/jira/browse/SPARK-16180 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.1 >Reporter: Davies Liu > > {code} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:107) > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) > org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) > org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) > org.apache.spark.storage.BlockManager.get(BlockManager.scala:630) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) > org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46) > org.apache.spark.scheduler.Task.run(Task.scala:96) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16180) Task hang on fetching blocks (cached RDD)
Davies Liu created SPARK-16180: -- Summary: Task hang on fetching blocks (cached RDD) Key: SPARK-16180 URL: https://issues.apache.org/jira/browse/SPARK-16180 Project: Spark Issue Type: Improvement Reporter: Davies Liu {code} sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) scala.concurrent.Await$.result(package.scala:107) org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588) org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585) org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570) org.apache.spark.storage.BlockManager.get(BlockManager.scala:630) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44) org.apache.spark.rdd.RDD.iterator(RDD.scala:268) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) org.apache.spark.rdd.RDD.iterator(RDD.scala:270) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46) org.apache.spark.scheduler.Task.run(Task.scala:96) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347454#comment-15347454 ] Hossein Falaki commented on SPARK-15516: Here is an example: {code} create table mytable using parquet options (path "/logs/parquet-files", mergeSchema "true") {code} > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16179) UDF explosion yielding empty dataframe fails
Vladimir Feinberg created SPARK-16179: - Summary: UDF explosion yielding empty dataframe fails Key: SPARK-16179 URL: https://issues.apache.org/jira/browse/SPARK-16179 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Vladimir Feinberg Command to replicate https://gist.github.com/vlad17/cff2bab81929f44556a364ee90981ac0 Resulting failure https://gist.github.com/vlad17/964c0a93510d79cb130c33700f6139b7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335165#comment-15335165 ] Manoj Kumar edited comment on SPARK-14351 at 6/23/16 11:43 PM: --- OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent of total time spent in impurityCalculator || percent of total time spent in impurityStats || |1 | 1 | 500 | 7.90 | 0.328% | 0.01% |10 | 1 | 500 | 7.67 | 1.3% | 0.12% |100 | 1 | 500 | 18.156 | 5.19% | 0.29% |1 | 500 | 1 | 7.1308 | 0.39% | 0.014% |10 | 500 | 1 | 7.5506 | 1.37% | 0.12% |100 | 500 | 1 | 17.61| 6.18% | 0.349% |1 | 1000 | 1000 | 6.99 | 0.28% | 0.029% |10 | 1000 | 1000 | 7.415 | 1.7% | 0.09% |100 | 1000 | 1000 | 17.89 | 6.1% | 0.3% |500 | 1000 | 1000 | 71.02 | 6.8% | 0.3% was (Author: mechcoder): OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent of total time spent in impurityCalculator || percent of total time spent in impurityStats || |1 | 1 | 500 | 7.90 | 0.328% | 0.01% |10 | 1 | 500 | 7.67 | 1.3% | 0.12% |100 | 1 | 500 | 18.156 | 5.19% | 0.29% 1 | 500 | 1 | 7.1308 | 0.39% | 0.014% |10 | 500 | 1 | 7.5506 | 1.37% | 0.12% |100 | 500 | 1 | 17.61| 6.18% | 0.349% |1 | 1000 | 1000 | 6.99 | 0.28% | 0.029% |10 | 1000 | 1000 | 7.415 | 1.7% | 0.09% |100 | 1000 | 1000 | 17.89 | 6.1% | 0.3% |500 | 1000 | 1000 | 71.02 | 6.8% | 0.3% > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14351) Optimize ImpurityAggregator for decision trees
[ https://issues.apache.org/jira/browse/SPARK-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335165#comment-15335165 ] Manoj Kumar edited comment on SPARK-14351 at 6/23/16 11:43 PM: --- OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent of total time spent in impurityCalculator || percent of total time spent in impurityStats || |1 | 1 | 500 | 7.90 | 0.328% | 0.01% |10 | 1 | 500 | 7.67 | 1.3% | 0.12% |100 | 1 | 500 | 18.156 | 5.19% | 0.29% 1 | 500 | 1 | 7.1308 | 0.39% | 0.014% |10 | 500 | 1 | 7.5506 | 1.37% | 0.12% |100 | 500 | 1 | 17.61| 6.18% | 0.349% |1 | 1000 | 1000 | 6.99 | 0.28% | 0.029% |10 | 1000 | 1000 | 7.415 | 1.7% | 0.09% |100 | 1000 | 1000 | 17.89 | 6.1% | 0.3% |500 | 1000 | 1000 | 71.02 | 6.8% | 0.3% was (Author: mechcoder): OK, so here are some benchmarks that validate your claims partially (All trained to maxDepth=30 and the auto feature selection strategy). The trend is that as the number of trees increase, it seems to have a higher impact. I'll see what I can optimize tomorrow. || n_tree || n_samples || n_features || totalTime || percent in binsToBestSplit || percent in impurityCalculator || percent in impurityStatsTime || |1 | 1 | 500 | 2 | 19.5% | 15% | 0.1% |10 | 1 | 500 | 2.45 | 13% | 8.5%| 0.7% |100 | 1 | 500 | 4.48 | 64.5% | 41.5% | 2.1% |500 | 1 | 500 | 15.2 | 89.6% | 61.1% | 3.4% |1 | 500 | 1 | 2.16 | 18.5% | 16.2% | ~ |10 | 500 | 1 | 2.70 | 14.8% | 11.1%| 0.4% |100 | 500 | 1 | 9.07 | 43.5% | 31.4% | 1.9% |1 | 1000 | 1000 | 2.02 | 24.7% | 14.8% | 0.2% |10 | 1000 | 1000 | 6.2 | 12.8% | 9.6%| 0.1% |50 | 1000 | 1000 | 4.05 | 38.5% | 28.8% | 2.8% |100 | 1000 | 1000 | 10.19 | 45.3% | 30.6% | 3.18% > Optimize ImpurityAggregator for decision trees > -- > > Key: SPARK-14351 > URL: https://issues.apache.org/jira/browse/SPARK-14351 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > {{RandomForest.binsToBestSplit}} currently takes a large amount of time. > Based on some quick profiling, I believe a big chunk of this is spent in > {{ImpurityAggregator.getCalculator}} (which seems to make unnecessary Array > copies) and {{RandomForest.calculateImpurityStats}}. > This JIRA is for: > * Doing more profiling to confirm that unnecessary time is being spent in > some of these methods. > * Optimizing the implementation > * Profiling again to confirm the speedups > Local profiling for large enough examples should suffice, especially since > the optimizations should not need to change the amount of data communicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16178) SQL - Hive writer should not require partition names to match table partitions
Ryan Blue created SPARK-16178: - Summary: SQL - Hive writer should not require partition names to match table partitions Key: SPARK-16178 URL: https://issues.apache.org/jira/browse/SPARK-16178 Project: Spark Issue Type: Sub-task Reporter: Ryan Blue SPARK-14459 added a check that the {{partition}} metadata on {{InsertIntoTable}} must match the table's partition column names. But if {{partitionBy}} is used to set up partition columns, those columns may not be named or the names may not match. For example: {code} // Tables: // CREATE TABLE src (id string, date int, hour int, timestamp bigint); // CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) // PARTITIONED BY (utc_dateint int, utc_hour int); spark.table("src").write.partitionBy("date", "hour").insertInto("dest") {code} The call to partitionBy correctly places the date and hour columns at the end of the logical plan, but the names don't match the "utc_" prefix and the write fails. But the analyzer will verify the types and insert an {{Alias}} so the query is actually valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15516) Schema merging in driver fails for parquet when merging LongType and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-15516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347428#comment-15347428 ] MIN-FU YANG commented on SPARK-15516: - Do you have any sample code? > Schema merging in driver fails for parquet when merging LongType and > IntegerType > > > Key: SPARK-15516 > URL: https://issues.apache.org/jira/browse/SPARK-15516 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Databricks >Reporter: Hossein Falaki > > I tried to create a table from partitioned parquet directories that requires > schema merging. I get following error: > {code} > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:831) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:826) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:826) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24.apply(ParquetRelation.scala:801) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$22.apply(RDD.scala:756) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Failed to merge incompatible data > types LongType and IntegerType > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:462) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:420) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$3.apply(StructType.scala:418) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:418) > at > org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:415) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.sql.types.StructType$.merge(StructType.scala:415) > at org.apache.spark.sql.types.StructType.merge(StructType.scala:333) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$24$$anonfun$apply$9.apply(ParquetRelation.scala:829) > {code} > cc @rxin and [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16165) Fix the update logic for InMemoryTableScanExec.readBatches accumulator
[ https://issues.apache.org/jira/browse/SPARK-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-16165. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13870 [https://github.com/apache/spark/pull/13870] > Fix the update logic for InMemoryTableScanExec.readBatches accumulator > -- > > Key: SPARK-16165 > URL: https://issues.apache.org/jira/browse/SPARK-16165 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated > only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. > Although this metric is used for only testing purpose, we had better have > correct metric without considering SQL options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16165) Fix the update logic for InMemoryTableScanExec.readBatches accumulator
[ https://issues.apache.org/jira/browse/SPARK-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16165: --- Assignee: Dongjoon Hyun > Fix the update logic for InMemoryTableScanExec.readBatches accumulator > -- > > Key: SPARK-16165 > URL: https://issues.apache.org/jira/browse/SPARK-16165 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > Currently, `readBatches` accumulator of `InMemoryTableScanExec` is updated > only when `spark.sql.inMemoryColumnarStorage.partitionPruning` is true. > Although this metric is used for only testing purpose, we had better have > correct metric without considering SQL options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16174) Improve OptimizeIn optimizer to remove deterministic repetitions
[ https://issues.apache.org/jira/browse/SPARK-16174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16174: -- Description: This issue improves `OptimizeIn` optimizer to remove the deterministic repetitions from SQL `IN` predicates. This optimizer prevents user mistakes and also can optimize some queries like [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19]. (was: This issue adds an optimizer to remove the duplicated literals from SQL `IN` predicates. This optimizer prevents user mistakes and also can optimize some queries like [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19].) Summary: Improve OptimizeIn optimizer to remove deterministic repetitions (was: Add RemoveLiteralRepetitionFromIn optimizer) > Improve OptimizeIn optimizer to remove deterministic repetitions > > > Key: SPARK-16174 > URL: https://issues.apache.org/jira/browse/SPARK-16174 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Priority: Minor > > This issue improves `OptimizeIn` optimizer to remove the deterministic > repetitions from SQL `IN` predicates. This optimizer prevents user mistakes > and also can optimize some queries like > [TPCDS-36|https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q36.sql#L19]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15663) SparkSession.catalog.listFunctions shouldn't include the list of built-in functions
[ https://issues.apache.org/jira/browse/SPARK-15663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15663: - Labels: release_notes releasenotes (was: ) > SparkSession.catalog.listFunctions shouldn't include the list of built-in > functions > --- > > Key: SPARK-15663 > URL: https://issues.apache.org/jira/browse/SPARK-15663 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sandeep Singh > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > SparkSession.catalog.listFunctions currently returns all functions, including > the list of built-in functions. This makes the method not as useful because > anytime it is run the result set contains over 100 built-in functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16177) model loading backward compatibility for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16177: Assignee: Apache Spark > model loading backward compatibility for ml.regression > -- > > Key: SPARK-16177 > URL: https://issues.apache.org/jira/browse/SPARK-16177 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16177) model loading backward compatibility for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347399#comment-15347399 ] Apache Spark commented on SPARK-16177: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/13879 > model loading backward compatibility for ml.regression > -- > > Key: SPARK-16177 > URL: https://issues.apache.org/jira/browse/SPARK-16177 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16177) model loading backward compatibility for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16177: Assignee: (was: Apache Spark) > model loading backward compatibility for ml.regression > -- > > Key: SPARK-16177 > URL: https://issues.apache.org/jira/browse/SPARK-16177 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347172#comment-15347172 ] Ryan Blue edited comment on SPARK-16032 at 6/23/16 10:59 PM: - bq. I am not sure apply by-name resolution just to partition columns is a good idea. I'm not sure about this. After looking into it more, I agree in principle and that in the long term we don't want to mix by-position column matching with by-name partitioning. But I'm less certain about whether or not it's a good idea right now. As I look at it more, I agree with you guys more about what is "right". But, I'm still concerned about how to move forward from where we're at, given the way people are currently using the API. I think we've already established that it isn't clear that the DataFrameWriter API relies on position. I actually think that most people aren't thinking about the choice between by-position or by-name resolution and are using what they get working. My first use of the API was to build a partitioned table from an unpartitioned table, which failed. When I went looking for a solution, {{partitionBy}} was the obvious choice (suggested by my IDE) and, sure enough, it fixed the problem by [moving the partition columns by name|https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L180] to the end. This solution is common because it works and is more obvious than thinking about column order because, as I noted above, it isn't clear that {{insertInto}} is using position. The pattern of using {{partitionBy}} with {{insertInto}} has also become a best practice for maintaining ETL jobs in Spark. Consider this table setup, where data lands in {{src}} in batches and we move it to {{dest}} for long-term storage in Parquet. Here's some example DDL: {code:lang=sql} CREATE TABLE src (id string, timestamp bigint, other_properties map); CREATE TABLE dest (id string, timestamp bigint, c1 string, c2 int) PARTITIONED BY (utc_dateint int, utc_hour int); {code} The Spark code for this ETL job should be this: {code:lang=java} spark.table("src") .withColumn("c1", $"other_properties".getItem("c1")) .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType)) .withColumn("date", dateint($"timetamp")) .withColumn("hour", hour($"timestamp")) .dropColumn("other_properties") .write.insertInto("dest") {code} But, users are likely to try this next version instead. That's because it isn't obvious that partition columns go after data columns; they are two separate lists in the DDL. {code:lang=java} spark.table("src") .withColumn("date", dateint($"timetamp")) .withColumn("hour", hour($"timestamp")) .withColumn("c1", $"other_properties".getItem("c1")) .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType)) .dropColumn("other_properties") .write.insertInto("dest") {code} And again, the most obvious fix is to add {{partitionBy}} to specify the partition columns, which appears to users as a match for Hive's {{PARTITION("date", "hour")}} syntax. Users then get the impression that {{partitionBy}} is equivalent to {{PARTITION}}, though in reality Hive operates by position. {code:lang=java} spark.table("src") .withColumn("date", dateint($"timetamp")) .withColumn("hour", hour($"timestamp")) .withColumn("c1", $"other_properties".getItem("c1")) .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType)) .dropColumn("other_properties") .write.partitionBy("date", "hour").insertInto("dest")Another case {code} Another reason to use {{partitionBy}} is for maintaining ETL over time. When structure changes, so does column order. Say I want to add a dedup step so I get just one row per ID per day. My first attempt, based on getting the column order right to begin with, looks like this: {code:lang=java} // column orders change, causing the query to break spark.table("src") .withColumn("date", dateint($"timetamp")) // moved to before dropDuplicates .dropDuplicates($"date", $"id") // added to dedup records .withColumn("c1", $"other_properties".getItem("c1")) .withColumn("c2", $"other_properties".getItem("c2").cast(IntegerType)) .withColumn("hour", hour($"timestamp")) .dropColumn("other_properties") .write.insertInto("dest") {code} The result is that I get crazy partitioning because c2 and hour are used. The most obvious symptom of the wrong column order is partitioning and when I look into it, I find {{partitionBy}} fixes it. In many cases, that's the first method I'll try because I see bad partition values. This solution doesn't always fix the query, but it does solve the partitioning problem I observed. (Also: other order problems are hidden by inserting {{Cast}} instead of the safer {{UpCast}}.) Users will also _choose_ this over the right solution, whi
[jira] [Updated] (SPARK-15443) Properly explain the streaming queries
[ https://issues.apache.org/jira/browse/SPARK-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15443: - Fix Version/s: (was: 2.0.0) 2.1.0 2.0.1 > Properly explain the streaming queries > -- > > Key: SPARK-15443 > URL: https://issues.apache.org/jira/browse/SPARK-15443 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Currently when called `explain()` on streaming dataset, it will only get the > parsed and analyzed logical plan, exceptions for optimized logical plan and > physical plan, like below: > {code} > scala> res0.explain(true) > == Parsed Logical Plan == > FileSource[file:///tmp/input] > == Analyzed Logical Plan == > value: string > FileSource[file:///tmp/input] > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Queries with streaming sources must > be executed with write.startStream(); > == Physical Plan == > org.apache.spark.sql.AnalysisException: Queries with streaming sources must > be executed with write.startStream(); > {code} > The reason is that structure streaming dynamically materialize the plan in > the run-time. > So here we should figure out a way to properly get the streaming plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15443) Properly explain the streaming queries
[ https://issues.apache.org/jira/browse/SPARK-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15443. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13815 [https://github.com/apache/spark/pull/13815] > Properly explain the streaming queries > -- > > Key: SPARK-15443 > URL: https://issues.apache.org/jira/browse/SPARK-15443 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 2.0.0 > > > Currently when called `explain()` on streaming dataset, it will only get the > parsed and analyzed logical plan, exceptions for optimized logical plan and > physical plan, like below: > {code} > scala> res0.explain(true) > == Parsed Logical Plan == > FileSource[file:///tmp/input] > == Analyzed Logical Plan == > value: string > FileSource[file:///tmp/input] > == Optimized Logical Plan == > org.apache.spark.sql.AnalysisException: Queries with streaming sources must > be executed with write.startStream(); > == Physical Plan == > org.apache.spark.sql.AnalysisException: Queries with streaming sources must > be executed with write.startStream(); > {code} > The reason is that structure streaming dynamically materialize the plan in > the run-time. > So here we should figure out a way to properly get the streaming plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16177) model loading backward compatibility for ml.regression
yuhao yang created SPARK-16177: -- Summary: model loading backward compatibility for ml.regression Key: SPARK-16177 URL: https://issues.apache.org/jira/browse/SPARK-16177 Project: Spark Issue Type: Improvement Components: ML Reporter: yuhao yang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15854) Spark History server gets null pointer exception
[ https://issues.apache.org/jira/browse/SPARK-15854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesha Vora resolved SPARK-15854. Resolution: Cannot Reproduce > Spark History server gets null pointer exception > > > Key: SPARK-15854 > URL: https://issues.apache.org/jira/browse/SPARK-15854 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Yesha Vora > > In Spark2, Spark-History Server is configured to FSHistoryProvider. > Spark HS does not show any finished/running applications and gets Null > pointer exception. > {code} > 16/06/03 23:06:40 INFO FsHistoryProvider: Replaying log path: > hdfs://xx:8020/spark2-history/application_1464912457462_0002.inprogress > 16/06/03 23:06:50 INFO FsHistoryProvider: Replaying log path: > hdfs://xx:8020/spark2-history/application_1464912457462_0002 > 16/06/03 23:08:27 WARN ServletHandler: Error for /api/v1/applications > java.lang.NoSuchMethodError: > javax.ws.rs.core.Application.getProperties()Ljava/util/Map; > at > org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:331) > at > org.glassfish.jersey.servlet.WebComponent.(WebComponent.java:392) > at > org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:177) > at > org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:369) > at javax.servlet.GenericServlet.init(GenericServlet.java:244) > at > org.spark_project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:616) > at > org.spark_project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:472) > at > org.spark_project.jetty.servlet.ServletHolder.ensureInstance(ServletHolder.java:767) > at > org.spark_project.jetty.servlet.ServletHolder.prepare(ServletHolder.java:752) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > 16/06/03 23:08:33 WARN ServletHandler: /api/v1/applications > java.lang.NullPointerException > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) >
[jira] [Commented] (SPARK-15955) Failed Spark application returns with exitcode equals to zero
[ https://issues.apache.org/jira/browse/SPARK-15955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347360#comment-15347360 ] Yesha Vora commented on SPARK-15955: [~sowen], I'm checking exit code of the process which started the application. [~tgraves]. This issue happens with yarn-client and yarn-cluster mode both. > Failed Spark application returns with exitcode equals to zero > - > > Key: SPARK-15955 > URL: https://issues.apache.org/jira/browse/SPARK-15955 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora > > Scenario: > * Set up cluster with wire-encryption enabled. > * set 'spark.authenticate.enableSaslEncryption' = 'false' and > 'spark.shuffle.service.enabled' :'true' > * run sparkPi application. > {code} > client token: Token { kind: YARN_CLIENT_TOKEN, service: } > diagnostics: Max number of executor failures (3) reached > ApplicationMaster host: xx.xx.xx.xxx > ApplicationMaster RPC port: 0 > queue: default > start time: 1465941051976 > final status: FAILED > tracking URL: https://xx.xx.xx.xxx:8090/proxy/application_1465925772890_0016/ > user: hrt_qa > Exception in thread "main" org.apache.spark.SparkException: Application > application_1465925772890_0016 finished with failed status > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1092) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1139) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > INFO ShutdownHookManager: Shutdown hook called{code} > This spark application exits with exitcode = 0. Failed application should not > return with exitcode = 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347355#comment-15347355 ] roberto hashioka commented on SPARK-13288: -- Do you really need to create multiple streams with DirectStream? I think you can create just one and Spark does the rest. > [1.6.0] Memory leak in Spark streaming > -- > > Key: SPARK-13288 > URL: https://issues.apache.org/jira/browse/SPARK-13288 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 > Environment: Bare metal cluster > RHEL 6.6 >Reporter: JESSE CHEN > Labels: streaming > > Streaming in 1.6 seems to have a memory leak. > Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 > showed a gradual increasing processing time. > The app is simple: 1 Kafka receiver of tweet stream and 20 executors > processing the tweets in 5-second batches. > Spark 1.5.0 handles this smoothly and did not show increasing processing time > in the 40-minute test; but 1.6 showed increasing time about 8 minutes into > the test. Please see chart here: > https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116 > I captured heap dumps in two version and did a comparison. I noticed the Byte > is using 50X more space in 1.5.1. > Here are some top classes in heap histogram and references. > Heap Histogram > > All Classes (excluding platform) > 1.6.0 Streaming 1.5.1 Streaming > Class Instance Count Total Size Class Instance Count Total > Size > class [B 84533,227,649,599 class [B5095 > 62,938,466 > class [C 44682 4,255,502 class [C130482 > 12,844,182 > class java.lang.reflect.Method90591,177,670 class > java.lang.String 130171 1,562,052 > > > References by TypeReferences by Type > > class [B [0x640039e38]class [B [0x6c020bb08] > > > Referrers by Type Referrers by Type > > Class Count Class Count > java.nio.HeapByteBuffer 3239 > sun.security.util.DerInputBuffer1233 > sun.security.util.DerInputBuffer 1233 > sun.security.util.ObjectIdentifier 620 > sun.security.util.ObjectIdentifier620 [[B 397 > [Ljava.lang.Object; 408 java.lang.reflect.Method > 326 > > The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0. > The Java.nio.HeapByteBuffer referencing class did not show up in top in > 1.5.1. > I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them > here > https://ibm.box.com/sparkstreaming-jstack160 > https://ibm.box.com/sparkstreaming-jstack151 > Jesse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347338#comment-15347338 ] JESSE CHEN commented on SPARK-13288: [~AlexSparkJiang]The code is: val multiTweetStreams=(1 to numStreams).map {i => KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) } // unified stream val tweetStream= ssc.union(multiTweetStreams) > [1.6.0] Memory leak in Spark streaming > -- > > Key: SPARK-13288 > URL: https://issues.apache.org/jira/browse/SPARK-13288 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 > Environment: Bare metal cluster > RHEL 6.6 >Reporter: JESSE CHEN > Labels: streaming > > Streaming in 1.6 seems to have a memory leak. > Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 > showed a gradual increasing processing time. > The app is simple: 1 Kafka receiver of tweet stream and 20 executors > processing the tweets in 5-second batches. > Spark 1.5.0 handles this smoothly and did not show increasing processing time > in the 40-minute test; but 1.6 showed increasing time about 8 minutes into > the test. Please see chart here: > https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116 > I captured heap dumps in two version and did a comparison. I noticed the Byte > is using 50X more space in 1.5.1. > Here are some top classes in heap histogram and references. > Heap Histogram > > All Classes (excluding platform) > 1.6.0 Streaming 1.5.1 Streaming > Class Instance Count Total Size Class Instance Count Total > Size > class [B 84533,227,649,599 class [B5095 > 62,938,466 > class [C 44682 4,255,502 class [C130482 > 12,844,182 > class java.lang.reflect.Method90591,177,670 class > java.lang.String 130171 1,562,052 > > > References by TypeReferences by Type > > class [B [0x640039e38]class [B [0x6c020bb08] > > > Referrers by Type Referrers by Type > > Class Count Class Count > java.nio.HeapByteBuffer 3239 > sun.security.util.DerInputBuffer1233 > sun.security.util.DerInputBuffer 1233 > sun.security.util.ObjectIdentifier 620 > sun.security.util.ObjectIdentifier620 [[B 397 > [Ljava.lang.Object; 408 java.lang.reflect.Method > 326 > > The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0. > The Java.nio.HeapByteBuffer referencing class did not show up in top in > 1.5.1. > I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them > here > https://ibm.box.com/sparkstreaming-jstack160 > https://ibm.box.com/sparkstreaming-jstack151 > Jesse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16164) CombineFilters should keep the ordering in the logical plan
[ https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16164: -- Summary: CombineFilters should keep the ordering in the logical plan (was: Filter pushdown should keep the ordering in the logical plan) > CombineFilters should keep the ordering in the logical plan > --- > > Key: SPARK-16164 > URL: https://issues.apache.org/jira/browse/SPARK-16164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Dongjoon Hyun > Fix For: 2.0.1, 2.1.0 > > > [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with > additional filters. It seems that during filter pushdown, we changed the > ordering in the logical plan. I'm not sure whether we should treat this as a > bug. > {code} > val df1 = (0 until 3).map(_.toString).toDF > val indexer = new StringIndexer() > .setInputCol("value") > .setOutputCol("idx") > .setHandleInvalid("skip") > .fit(df1) > val df2 = (0 until 5).map(_.toString).toDF > val predictions = indexer.transform(df2) > predictions.show() // this is okay > predictions.where('idx > 2).show() // this will throw an exception > {code} > Please see the notebook at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html > for error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16176) model loading backward compatibility for ml.recommendation
yuhao yang created SPARK-16176: -- Summary: model loading backward compatibility for ml.recommendation Key: SPARK-16176 URL: https://issues.apache.org/jira/browse/SPARK-16176 Project: Spark Issue Type: Improvement Components: ML Reporter: yuhao yang Priority: Minor Check if current ALS can load the models saved by Apache Spark 1.6. If not, we need a fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16164) Filter pushdown should keep the ordering in the logical plan
[ https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16164. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13872 [https://github.com/apache/spark/pull/13872] > Filter pushdown should keep the ordering in the logical plan > > > Key: SPARK-16164 > URL: https://issues.apache.org/jira/browse/SPARK-16164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > Fix For: 2.0.1, 2.1.0 > > > [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with > additional filters. It seems that during filter pushdown, we changed the > ordering in the logical plan. I'm not sure whether we should treat this as a > bug. > {code} > val df1 = (0 until 3).map(_.toString).toDF > val indexer = new StringIndexer() > .setInputCol("value") > .setOutputCol("idx") > .setHandleInvalid("skip") > .fit(df1) > val df2 = (0 until 5).map(_.toString).toDF > val predictions = indexer.transform(df2) > predictions.show() // this is okay > predictions.where('idx > 2).show() // this will throw an exception > {code} > Please see the notebook at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html > for error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16164) Filter pushdown should keep the ordering in the logical plan
[ https://issues.apache.org/jira/browse/SPARK-16164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16164: -- Assignee: Dongjoon Hyun > Filter pushdown should keep the ordering in the logical plan > > > Key: SPARK-16164 > URL: https://issues.apache.org/jira/browse/SPARK-16164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Dongjoon Hyun > Fix For: 2.0.1, 2.1.0 > > > [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with > additional filters. It seems that during filter pushdown, we changed the > ordering in the logical plan. I'm not sure whether we should treat this as a > bug. > {code} > val df1 = (0 until 3).map(_.toString).toDF > val indexer = new StringIndexer() > .setInputCol("value") > .setOutputCol("idx") > .setHandleInvalid("skip") > .fit(df1) > val df2 = (0 until 5).map(_.toString).toDF > val predictions = indexer.transform(df2) > predictions.show() // this is okay > predictions.where('idx > 2).show() // this will throw an exception > {code} > Please see the notebook at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html > for error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11744) bin/pyspark --version doesn't return version and exit
[ https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347297#comment-15347297 ] Nicholas Chammas commented on SPARK-11744: -- This is not the appropriate place to ask random PySpark questions. Please post a question on Stack Overflow or on the Spark user list. > bin/pyspark --version doesn't return version and exit > - > > Key: SPARK-11744 > URL: https://issues.apache.org/jira/browse/SPARK-11744 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Nicholas Chammas >Assignee: Saisai Shao >Priority: Minor > Fix For: 1.6.0 > > > {{bin/pyspark \-\-help}} offers a {{\-\-version}} option: > {code} > $ ./spark/bin/pyspark --help > Usage: ./bin/pyspark [options] > Options: > ... > --version, Print the version of current Spark > ... > {code} > However, trying to get the version in this way doesn't yield the expected > results. > Instead of printing the version and exiting, we get the version, a stack > trace, and then get dropped into a broken PySpark shell. > {code} > $ ./spark/bin/pyspark --version > Python 2.7.10 (default, Aug 11 2015, 23:39:10) > [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.5.2 > /_/ > > Type --help for more information. > Traceback (most recent call last): > File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in > sc = SparkContext(pyFiles=add_files) > File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in > _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in > launch_gateway > raise Exception("Java gateway process exited before sending the driver > its port number") > Exception: Java gateway process exited before sending the driver its port > number > >>> > >>> sc > Traceback (most recent call last): > File "", line 1, in > NameError: name 'sc' is not defined > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16175) Handle None for all Python UDT
[ https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16175: Assignee: (was: Apache Spark) > Handle None for all Python UDT > -- > > Key: SPARK-16175 > URL: https://issues.apache.org/jira/browse/SPARK-16175 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Davies Liu > Attachments: nullvector.dbc > > > For Scala UDT, we will not call serialize()/deserialize() for all null, we > should also do that in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16175) Handle None for all Python UDT
[ https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16175: --- Affects Version/s: 2.0.0 1.6.1 > Handle None for all Python UDT > -- > > Key: SPARK-16175 > URL: https://issues.apache.org/jira/browse/SPARK-16175 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Davies Liu > Attachments: nullvector.dbc > > > For Scala UDT, we will not call serialize()/deserialize() for all null, we > should also do that in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16175) Handle None for all Python UDT
[ https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16175: Assignee: Apache Spark > Handle None for all Python UDT > -- > > Key: SPARK-16175 > URL: https://issues.apache.org/jira/browse/SPARK-16175 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Davies Liu >Assignee: Apache Spark > Attachments: nullvector.dbc > > > For Scala UDT, we will not call serialize()/deserialize() for all null, we > should also do that in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16175) Handle None for all Python UDT
[ https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347287#comment-15347287 ] Apache Spark commented on SPARK-16175: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/13878 > Handle None for all Python UDT > -- > > Key: SPARK-16175 > URL: https://issues.apache.org/jira/browse/SPARK-16175 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Davies Liu > Attachments: nullvector.dbc > > > For Scala UDT, we will not call serialize()/deserialize() for all null, we > should also do that in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347255#comment-15347255 ] Bo Meng edited comment on SPARK-16173 at 6/23/16 9:58 PM: -- Use the latest master, I was not able to reproduce, here is my code: {quote} val a = Seq(("Alice", 1)).toDF("name", "age").describe() val b = Seq(("Bob", 2)).toDF("name", "grade").describe() a.show() b.show() a.join(b, Seq("summary")).show() {quote} Anything I am missing? I am using Scala 2.11, does it only happen to Scala 2.10? was (Author: bomeng): Use the latest master, I was not able to reproduce, here is my code: {quote} val a = Seq(("Alice", 1)).toDF("name", "age").describe() val b = Seq(("Bob", 2)).toDF("name", "grade").describe() a.show() b.show() a.join(b, Seq("summary")).show() {quote} Anything I am missing? > Can't join describe() of DataFrame in Scala 2.10 > > > Key: SPARK-16173 > URL: https://issues.apache.org/jira/browse/SPARK-16173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Davies Liu > > descripbe() of DataFrame use Seq() (it's a Iterator actually) to create > another DataFrame, which can not be serialized in Scala 2.10. > {code} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2060) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706) > at > org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.NotSerializableException: > scala.collection.Iterator$$anon$11 > Serialization stack: > - object not serializable (class: scala.collection.Iterator$$anon$11, > value: empty iterator) > - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: > $outer, type: interface scala.collection.Iterator) > - object (class scala.collection.Iterator$$anonfun$toStream$1, > ) > - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: > interface scala.Function0) > - object (class scala.collection.immutable.Stream$Cons, > Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), > WrappedArray(2), WrappedArray(2))) > - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: > $outer, type: class scala.collection.immutable.Stream) > - object (class scala.collection.immutable.Stream$$anonfun$zip$1, > ) > - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: > interface scala.Function0) > - object
[jira] [Updated] (SPARK-16004) Improve CatalogTable information
[ https://issues.apache.org/jira/browse/SPARK-16004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Meng updated SPARK-16004: Description: A few issues found when running "describe extended | formatted [tableName]" command: 1. The last access time is incorrectly displayed something like "Last Access Time: |Wed Dec 31 15:59:59 PST 1969", I think we should display as "UNKNOWN" as Hive does; 2. Comments fields display "null" instead of empty string when commend is None; I will make a PR shortly. was: A few issues found when running "describe extended | formatted [tableName]" command: 1. The last access time is incorrectly displayed something like "Last Access Time: |Wed Dec 31 15:59:59 PST 1969", I think we should display as "UNKNOWN" as Hive does; 2. Owner is always empty, instead of the current login user, who creates the table; 3. Comments fields display "null" instead of empty string when commend is None; I will make a PR shortly. > Improve CatalogTable information > > > Key: SPARK-16004 > URL: https://issues.apache.org/jira/browse/SPARK-16004 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Bo Meng > > A few issues found when running "describe extended | formatted [tableName]" > command: > 1. The last access time is incorrectly displayed something like "Last Access > Time: |Wed Dec 31 15:59:59 PST 1969", I think we should display as > "UNKNOWN" as Hive does; > 2. Comments fields display "null" instead of empty string when commend is > None; > I will make a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15606) Driver hang in o.a.s.DistributedSuite on 2 core machine
[ https://issues.apache.org/jira/browse/SPARK-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-15606: - Fix Version/s: (was: 1.6.2) 1.6.3 > Driver hang in o.a.s.DistributedSuite on 2 core machine > --- > > Key: SPARK-15606 > URL: https://issues.apache.org/jira/browse/SPARK-15606 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: AMD64 box with only 2 cores >Reporter: Pete Robbins >Assignee: Pete Robbins > Fix For: 1.6.3, 2.0.0 > > > repeatedly failing task that crashes JVM *** FAILED *** > The code passed to failAfter did not complete within 10 milliseconds. > (DistributedSuite.scala:128) > This test started failing and DistrbutedSuite hanging following > https://github.com/apache/spark/pull/13055 > It looks like the extra message to remove the BlockManager deadlocks as there > are only 2 message processing loop threads. Related to > https://issues.apache.org/jira/browse/SPARK-13906 > {code} > /** Thread pool used for dispatching messages. */ > private val threadpool: ThreadPoolExecutor = { > val numThreads = > nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads", > math.max(2, Runtime.getRuntime.availableProcessors())) > val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, > "dispatcher-event-loop") > for (i <- 0 until numThreads) { > pool.execute(new MessageLoop) > } > pool > } > {code} > Setting a minimum of 3 threads alleviates this issue but I'm not sure there > isn't another underlying problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13288) [1.6.0] Memory leak in Spark streaming
[ https://issues.apache.org/jira/browse/SPARK-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347265#comment-15347265 ] roberto hashioka commented on SPARK-13288: -- I'm using the createDirectStream. > [1.6.0] Memory leak in Spark streaming > -- > > Key: SPARK-13288 > URL: https://issues.apache.org/jira/browse/SPARK-13288 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 > Environment: Bare metal cluster > RHEL 6.6 >Reporter: JESSE CHEN > Labels: streaming > > Streaming in 1.6 seems to have a memory leak. > Running the same streaming app in Spark 1.5.1 and 1.6, all things equal, 1.6 > showed a gradual increasing processing time. > The app is simple: 1 Kafka receiver of tweet stream and 20 executors > processing the tweets in 5-second batches. > Spark 1.5.0 handles this smoothly and did not show increasing processing time > in the 40-minute test; but 1.6 showed increasing time about 8 minutes into > the test. Please see chart here: > https://ibm.box.com/s/7q4ulik70iwtvyfhoj1dcl4nc469b116 > I captured heap dumps in two version and did a comparison. I noticed the Byte > is using 50X more space in 1.5.1. > Here are some top classes in heap histogram and references. > Heap Histogram > > All Classes (excluding platform) > 1.6.0 Streaming 1.5.1 Streaming > Class Instance Count Total Size Class Instance Count Total > Size > class [B 84533,227,649,599 class [B5095 > 62,938,466 > class [C 44682 4,255,502 class [C130482 > 12,844,182 > class java.lang.reflect.Method90591,177,670 class > java.lang.String 130171 1,562,052 > > > References by TypeReferences by Type > > class [B [0x640039e38]class [B [0x6c020bb08] > > > Referrers by Type Referrers by Type > > Class Count Class Count > java.nio.HeapByteBuffer 3239 > sun.security.util.DerInputBuffer1233 > sun.security.util.DerInputBuffer 1233 > sun.security.util.ObjectIdentifier 620 > sun.security.util.ObjectIdentifier620 [[B 397 > [Ljava.lang.Object; 408 java.lang.reflect.Method > 326 > > The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0. > The Java.nio.HeapByteBuffer referencing class did not show up in top in > 1.5.1. > I have also placed jstack output for 1.5.1 and 1.6.0 online..you can get them > here > https://ibm.box.com/sparkstreaming-jstack160 > https://ibm.box.com/sparkstreaming-jstack151 > Jesse -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347255#comment-15347255 ] Bo Meng edited comment on SPARK-16173 at 6/23/16 9:38 PM: -- Use the latest master, I was not able to reproduce, here is my code: {quote} val a = Seq(("Alice", 1)).toDF("name", "age").describe() val b = Seq(("Bob", 2)).toDF("name", "grade").describe() a.show() b.show() a.join(b, Seq("summary")).show() {quote} Anything I am missing? was (Author: bomeng): Use the latest master, I was not able to reproduce, here is my code: {{ val a = Seq(("Alice", 1)).toDF("name", "age").describe() val b = Seq(("Bob", 2)).toDF("name", "grade").describe() a.show() b.show() a.join(b, Seq("summary")).show() }} Anything I am missing? > Can't join describe() of DataFrame in Scala 2.10 > > > Key: SPARK-16173 > URL: https://issues.apache.org/jira/browse/SPARK-16173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Davies Liu > > descripbe() of DataFrame use Seq() (it's a Iterator actually) to create > another DataFrame, which can not be serialized in Scala 2.10. > {code} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2060) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706) > at > org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.NotSerializableException: > scala.collection.Iterator$$anon$11 > Serialization stack: > - object not serializable (class: scala.collection.Iterator$$anon$11, > value: empty iterator) > - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: > $outer, type: interface scala.collection.Iterator) > - object (class scala.collection.Iterator$$anonfun$toStream$1, > ) > - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: > interface scala.Function0) > - object (class scala.collection.immutable.Stream$Cons, > Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), > WrappedArray(2), WrappedArray(2))) > - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: > $outer, type: class scala.collection.immutable.Stream) > - object (class scala.collection.immutable.Stream$$anonfun$zip$1, > ) > - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: > interface scala.Function0) > - object (class scala.collection.immutable.Stream$Cons, > Stream((WrappedArray
[jira] [Comment Edited] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347255#comment-15347255 ] Bo Meng edited comment on SPARK-16173 at 6/23/16 9:37 PM: -- Use the latest master, I was not able to reproduce, here is my code: {{ val a = Seq(("Alice", 1)).toDF("name", "age").describe() val b = Seq(("Bob", 2)).toDF("name", "grade").describe() a.show() b.show() a.join(b, Seq("summary")).show() }} Anything I am missing? was (Author: bomeng): Use the latest master, I was not able to reproduce, here is my code: val a = Seq(("Alice", 1)).toDF("name", "age").describe() val b = Seq(("Bob", 2)).toDF("name", "grade").describe() a.show() b.show() a.join(b, Seq("summary")).show() Anything I am missing? > Can't join describe() of DataFrame in Scala 2.10 > > > Key: SPARK-16173 > URL: https://issues.apache.org/jira/browse/SPARK-16173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Davies Liu > > descripbe() of DataFrame use Seq() (it's a Iterator actually) to create > another DataFrame, which can not be serialized in Scala 2.10. > {code} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2060) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706) > at > org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.NotSerializableException: > scala.collection.Iterator$$anon$11 > Serialization stack: > - object not serializable (class: scala.collection.Iterator$$anon$11, > value: empty iterator) > - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: > $outer, type: interface scala.collection.Iterator) > - object (class scala.collection.Iterator$$anonfun$toStream$1, > ) > - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: > interface scala.Function0) > - object (class scala.collection.immutable.Stream$Cons, > Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), > WrappedArray(2), WrappedArray(2))) > - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: > $outer, type: class scala.collection.immutable.Stream) > - object (class scala.collection.immutable.Stream$$anonfun$zip$1, > ) > - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: > interface scala.Function0) > - object (class scala.collection.immutable.Stream$Cons, > Stream((WrappedArray(1),(count,)),
[jira] [Commented] (SPARK-16173) Can't join describe() of DataFrame in Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-16173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347255#comment-15347255 ] Bo Meng commented on SPARK-16173: - Use the latest master, I was not able to reproduce, here is my code: val a = Seq(("Alice", 1)).toDF("name", "age").describe() val b = Seq(("Bob", 2)).toDF("name", "grade").describe() a.show() b.show() a.join(b, Seq("summary")).show() Anything I am missing? > Can't join describe() of DataFrame in Scala 2.10 > > > Key: SPARK-16173 > URL: https://issues.apache.org/jira/browse/SPARK-16173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Davies Liu > > descripbe() of DataFrame use Seq() (it's a Iterator actually) to create > another DataFrame, which can not be serialized in Scala 2.10. > {code} > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2060) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706) > at > org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:82) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1$$anonfun$apply$1.apply(BroadcastHashJoin.scala:79) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$broadcastFuture$1.apply(BroadcastHashJoin.scala:79) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.NotSerializableException: > scala.collection.Iterator$$anon$11 > Serialization stack: > - object not serializable (class: scala.collection.Iterator$$anon$11, > value: empty iterator) > - field (class: scala.collection.Iterator$$anonfun$toStream$1, name: > $outer, type: interface scala.collection.Iterator) > - object (class scala.collection.Iterator$$anonfun$toStream$1, > ) > - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: > interface scala.Function0) > - object (class scala.collection.immutable.Stream$Cons, > Stream(WrappedArray(1), WrappedArray(2.0), WrappedArray(NaN), > WrappedArray(2), WrappedArray(2))) > - field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: > $outer, type: class scala.collection.immutable.Stream) > - object (class scala.collection.immutable.Stream$$anonfun$zip$1, > ) > - field (class: scala.collection.immutable.Stream$Cons, name: tl, type: > interface scala.Function0) > - object (class scala.collection.immutable.Stream$Cons, > Stream((WrappedArray(1),(count,)), > (WrappedArray(2.0),(mean,)), > (WrappedArray(NaN),(stddev,)), > (WrappedArray(2),(min,)), (WrappedArray(2),(max, > - field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: > $outer, type: class scala.collection.immutable.Stream) > - object (class scala.collection.immutable.Stream$$anonfun$map$1, > ) > - field (c
[jira] [Commented] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem
[ https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347247#comment-15347247 ] Bikas Saha commented on SPARK-15565: SPARK-15565 is pulled into Erie Spark2 by the last refresh. So I am resolving the jira. > The default value of spark.sql.warehouse.dir needs to explicitly point to > local filesystem > -- > > Key: SPARK-15565 > URL: https://issues.apache.org/jira/browse/SPARK-15565 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Critical > Fix For: 2.0.0 > > > The default value of {{spark.sql.warehouse.dir}} is > {{System.getProperty("user.dir")/warehouse}}. Since > {{System.getProperty("user.dir")}} is a local dir, we should explicitly set > the scheme to local filesystem. > This should be a one line change (at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58). > Also see > https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem
[ https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated SPARK-15565: --- Comment: was deleted (was: SPARK-15565 is pulled into Erie Spark2 by the last refresh. So I am resolving the jira.) > The default value of spark.sql.warehouse.dir needs to explicitly point to > local filesystem > -- > > Key: SPARK-15565 > URL: https://issues.apache.org/jira/browse/SPARK-15565 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Critical > Fix For: 2.0.0 > > > The default value of {{spark.sql.warehouse.dir}} is > {{System.getProperty("user.dir")/warehouse}}. Since > {{System.getProperty("user.dir")}} is a local dir, we should explicitly set > the scheme to local filesystem. > This should be a one line change (at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58). > Also see > https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15565) The default value of spark.sql.warehouse.dir needs to explicitly point to local filesystem
[ https://issues.apache.org/jira/browse/SPARK-15565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347247#comment-15347247 ] Bikas Saha edited comment on SPARK-15565 at 6/23/16 9:34 PM: - SPARK-15565 is pulled into Erie Spark2 by the last refresh. So I am resolving the jira. was (Author: bikassaha): SPARK-15565 is pulled into Erie Spark2 by the last refresh. So I am resolving the jira. > The default value of spark.sql.warehouse.dir needs to explicitly point to > local filesystem > -- > > Key: SPARK-15565 > URL: https://issues.apache.org/jira/browse/SPARK-15565 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Critical > Fix For: 2.0.0 > > > The default value of {{spark.sql.warehouse.dir}} is > {{System.getProperty("user.dir")/warehouse}}. Since > {{System.getProperty("user.dir")}} is a local dir, we should explicitly set > the scheme to local filesystem. > This should be a one line change (at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58). > Also see > https://issues.apache.org/jira/browse/SPARK-15034?focusedCommentId=15301508&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15301508 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16175) Handle None for all Python UDT
[ https://issues.apache.org/jira/browse/SPARK-16175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16175: -- Attachment: nullvector.dbc databricks nb demonstrating the issue > Handle None for all Python UDT > -- > > Key: SPARK-16175 > URL: https://issues.apache.org/jira/browse/SPARK-16175 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Davies Liu > Attachments: nullvector.dbc > > > For Scala UDT, we will not call serialize()/deserialize() for all null, we > should also do that in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16175) Handle None for all Python UDT
Davies Liu created SPARK-16175: -- Summary: Handle None for all Python UDT Key: SPARK-16175 URL: https://issues.apache.org/jira/browse/SPARK-16175 Project: Spark Issue Type: Improvement Components: PySpark, SQL Reporter: Davies Liu For Scala UDT, we will not call serialize()/deserialize() for all null, we should also do that in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16142) Group naive Bayes methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347205#comment-15347205 ] Apache Spark commented on SPARK-16142: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/13877 > Group naive Bayes methods in generated doc > -- > > Key: SPARK-16142 > URL: https://issues.apache.org/jira/browse/SPARK-16142 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Labels: starter > > Follow SPARK-16107 and group the doc of spark.naiveBayes: spark.naiveBayes, > predict(NB), summary(NB), read/write.ml(NB) under Rd spark.naiveBayes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13697) TransformFunctionSerializer.loads doesn't restore the function's module name if it's '__main__'
[ https://issues.apache.org/jira/browse/SPARK-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13697: - Fix Version/s: (was: 1.6.1) 1.6.2 > TransformFunctionSerializer.loads doesn't restore the function's module name > if it's '__main__' > --- > > Key: SPARK-13697 > URL: https://issues.apache.org/jira/browse/SPARK-13697 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.4.2, 1.5.3, 1.6.2, 2.0.0 > > > Here is a reproducer > {code} > >>> from pyspark.streaming import StreamingContext > >>> from pyspark.streaming.util import TransformFunction > >>> ssc = StreamingContext(sc, 1) > >>> func = TransformFunction(sc, lambda x: x, sc.serializer) > >>> func.rdd_wrapper(lambda x: x) > TransformFunction( at 0x106ac8b18>) > >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, > >>> func.rdd_wrap_func, func.deserializers))) > >>> func2 = ssc._transformerSerializer.loads(bytes) > >>> print(func2.func.__module__) > None > >>> print(func2.rdd_wrap_func.__module__) > None > >>> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13601) Invoke task failure callbacks before calling outputstream.close()
[ https://issues.apache.org/jira/browse/SPARK-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13601: - Fix Version/s: 1.6.2 > Invoke task failure callbacks before calling outputstream.close() > - > > Key: SPARK-13601 > URL: https://issues.apache.org/jira/browse/SPARK-13601 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.6.2, 2.0.0 > > > We need to submit another PR against Spark to call the task failure callbacks > before Spark calls the close function on various output streams. > For example, we need to intercept an exception and call > TaskContext.markTaskFailed before calling close in the following code (in > PairRDDFunctions.scala): > {code} > Utils.tryWithSafeFinally { > while (iter.hasNext) { > val record = iter.next() > writer.write(record._1.asInstanceOf[AnyRef], > record._2.asInstanceOf[AnyRef]) > // Update bytes written metric every few records > maybeUpdateOutputMetrics(outputMetricsAndBytesWrittenCallback, > recordsWritten) > recordsWritten += 1 > } > } { > writer.close() > } > {code} > Changes to Spark should include unit tests to make sure this always work in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13465) Add a task failure listener to TaskContext
[ https://issues.apache.org/jira/browse/SPARK-13465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13465: - Fix Version/s: 1.6.2 > Add a task failure listener to TaskContext > -- > > Key: SPARK-13465 > URL: https://issues.apache.org/jira/browse/SPARK-13465 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.6.2, 2.0.0 > > > TaskContext supports task completion callback, which gets called regardless > of task failures. However, there is no way for the listener to know if there > is an error. This ticket proposes adding a new listener that gets called when > a task fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org