[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
[ https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277716#comment-15277716 ] Hyukjin Kwon commented on SPARK-15245: -- (BTW, as you might already know, the reason why I thought the message is wrong was it should be {{path}} not {{basePath}} because {{path}} option is exposed to users with {{stream()}} API. (e.g. {{option("path", path).stream()}})) > stream API throws an exception with an incorrect message when the path is not > a direcotry > - > > Key: SPARK-15245 > URL: https://issues.apache.org/jira/browse/SPARK-15245 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > {code} > val path = "tmp.csv" // This is not a directory > val cars = spark.read > .format("csv") > .stream(path) > .write > .option("checkpointLocation", "streaming.metadata") > .startStream("tmp") > {code} > This throws an exception as below. > {code} > java.lang.IllegalArgumentException: Option 'basePath' must be a directory > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) > at > org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be > great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14261) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275238#comment-15275238 ] Weizhong edited comment on SPARK-14261 at 5/10/16 6:52 AM: --- I also face this issue. I found each session will add one HiveConf on sun.misc.Launcher$AppClassLoader and one on org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1. All these HiveConf can't be released until OOM. >From the OOM dump(Use Eclipse Memory Analyzer to analyze), all these HiveConf >have ref, it's GC root like below: {noformat} org.apache.hadoop.hive.conf.HiveConf conf org.apache.hadoop.hive.ql.session.SessionState$SessionStates value java.lang.ThreadLocal$ThreadLocalMap$Entry [19] java.lang.ThreadLocal$ThreadLocalMap$Entry[32] table java.lang.ThreadLocal$ThreadLocalMap threadLocals java.lang.Thread referent java.util.WeakHashMap$Entry conf org.apache.hadoop.hive.ql.session.SessionState state org.apache.spark.sql.hive.client.ClientWrapper metaHive, metadataHive, metaHive org.apache.spark.sql.hive.client.HiveContext $outer org.apache.spark.sql.SQLContext$$anon$4 [265] java.lang.Object[267] array java.util.concurrent.CopyWriteArrayList listeners org.apache.spark.scheduler.LiveListenerBus $outer org.apache.spark.util.AsynchronousListenerBus$$anon$1 {noformat} was (Author: sephiroth-lin): I also face this issue. I found every session will add one metastore connection, and when i dump the heap, i found have many Hive and HiveConf object don't be removed > Memory leak in Spark Thrift Server > -- > > Key: SPARK-14261 > URL: https://issues.apache.org/jira/browse/SPARK-14261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xiaochun Liang > Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, > MemorySnapshot.PNG > > > I am running Spark Thrift server on Windows Server 2012. The Spark Thrift > server is launched as Yarn client mode. Its memory usage is increased > gradually with the queries in. I am wondering there is memory leak in Spark > Thrift server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
[ https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277700#comment-15277700 ] Hyukjin Kwon commented on SPARK-15245: -- Thank you so much. Let me close my PR. > stream API throws an exception with an incorrect message when the path is not > a direcotry > - > > Key: SPARK-15245 > URL: https://issues.apache.org/jira/browse/SPARK-15245 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > {code} > val path = "tmp.csv" // This is not a directory > val cars = spark.read > .format("csv") > .stream(path) > .write > .option("checkpointLocation", "streaming.metadata") > .startStream("tmp") > {code} > This throws an exception as below. > {code} > java.lang.IllegalArgumentException: Option 'basePath' must be a directory > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) > at > org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be > great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
[ https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277697#comment-15277697 ] Sean Owen commented on SPARK-15245: --- The message seems correct. I agree it could possibly be checked earlier though, yes. I also agree it's not clear if it's worth a little extra code and extra calls to check it, since it's a rare failure mode and handled quickly and correctly anyway. It doesn't affect normal usage. > stream API throws an exception with an incorrect message when the path is not > a direcotry > - > > Key: SPARK-15245 > URL: https://issues.apache.org/jira/browse/SPARK-15245 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > {code} > val path = "tmp.csv" // This is not a directory > val cars = spark.read > .format("csv") > .stream(path) > .write > .option("checkpointLocation", "streaming.metadata") > .startStream("tmp") > {code} > This throws an exception as below. > {code} > java.lang.IllegalArgumentException: Option 'basePath' must be a directory > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) > at > org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be > great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
[ https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277696#comment-15277696 ] Hyukjin Kwon commented on SPARK-15245: -- Oh, the main reason is, I thought {{'basePath' must be a directory}} is not a correct message. Also, I realised this can be checked earlier in driver-side. It seems the exception is raised not in driver-side. So, diver-side seems not catching it. Above codes return 0 but print out the exception message in my local test. So, I thought anyway this can be caught earlier with a better message. But after thinking more, I started to get worried that opening the given paths might be overhead... especially for S3.. I am not sure of this one. I can close if we think it is not appropriate. > stream API throws an exception with an incorrect message when the path is not > a direcotry > - > > Key: SPARK-15245 > URL: https://issues.apache.org/jira/browse/SPARK-15245 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > {code} > val path = "tmp.csv" // This is not a directory > val cars = spark.read > .format("csv") > .stream(path) > .write > .option("checkpointLocation", "streaming.metadata") > .startStream("tmp") > {code} > This throws an exception as below. > {code} > java.lang.IllegalArgumentException: Option 'basePath' must be a directory > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) > at > org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be > great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15239) spark document conflict about mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-15239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15239. --- Resolution: Not A Problem If you take a look at the docs in the repo you can see this has been updated already. Mesos supports cluster mode. You can ask questions on the user list rather than here. > spark document conflict about mesos cluster > > > Key: SPARK-15239 > URL: https://issues.apache.org/jira/browse/SPARK-15239 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.6.1 >Reporter: JasonChang > > 1. http://spark.apache.org/docs/latest/submitting-applications.html > if your application is submitted from a machine far from the worker machines > (e.g. locally on your laptop), it is common to use cluster mode to minimize > network latency between the drivers and the executors. Note that cluster mode > is currently not supported for Mesos clusters. Currently only YARN supports > cluster mode for Python applications. > 2. http://spark.apache.org/docs/latest/running-on-mesos.html > Spark on Mesos also supports cluster mode, where the driver is launched in > the cluster and the client can find the results of the driver from the Mesos > Web UI. > I confused does mesos supports cluster mode? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation
[ https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15228. --- Resolution: Not A Problem > pyspark.RDD.toLocalIterator Documentation > - > > Key: SPARK-15228 > URL: https://issues.apache.org/jira/browse/SPARK-15228 > Project: Spark > Issue Type: Documentation >Reporter: Ignacio Tartavull >Priority: Trivial > > There is a little bug in the parsing of the documentation of > http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-14623: - Description: It relates to https://issues.apache.org/jira/browse/SPARK-7445 Map the labels to 0/1. For example, Input: "yellow,green,red,green,0" The labels: "0, green, red, yellow" Output: 0, 0, 0, 1 0, 1, 0, 0 0, 0, 1, 0 0, 1, 0, 0 1, 0 ,0, 0 Refer to http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html was: It relates to https://issues.apache.org/jira/browse/SPARK-7445 Map the labels to 0/1. For example, Input: "yellow,green,red,green,0" The labels: "0, green, red, yellow" Output: 0, 0, 0, 1 0, 1, 0, 0 0, 0, 1, 0 0, 1, 0, 0 1, 0 ,0, 0 > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: hujiayin >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 1 > 0, 1, 0, 0 > 0, 0, 1, 0 > 0, 1, 0, 0 > 1, 0 ,0, 0 > Refer to > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
[ https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277690#comment-15277690 ] Sean Owen commented on SPARK-15245: --- Hm, what's the issue here? it says the arg was not a directory, which is the problem. > stream API throws an exception with an incorrect message when the path is not > a direcotry > - > > Key: SPARK-15245 > URL: https://issues.apache.org/jira/browse/SPARK-15245 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > {code} > val path = "tmp.csv" // This is not a directory > val cars = spark.read > .format("csv") > .stream(path) > .write > .option("checkpointLocation", "streaming.metadata") > .startStream("tmp") > {code} > This throws an exception as below. > {code} > java.lang.IllegalArgumentException: Option 'basePath' must be a directory > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) > at > org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be > great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation
[ https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277688#comment-15277688 ] Sandeep Singh edited comment on SPARK-15228 at 5/10/16 6:21 AM: Seems to be fixed in the current master. Before !http://i.imgur.com/aiIP4u3.png! After (behaviour on current master) !http://i.imgur.com/jSpIySj.png! was (Author: techaddict): Seems to be fixed in the current master. Before !http://i.imgur.com/aiIP4u3.png! After !http://i.imgur.com/jSpIySj.png! > pyspark.RDD.toLocalIterator Documentation > - > > Key: SPARK-15228 > URL: https://issues.apache.org/jira/browse/SPARK-15228 > Project: Spark > Issue Type: Documentation >Reporter: Ignacio Tartavull >Priority: Trivial > > There is a little bug in the parsing of the documentation of > http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation
[ https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277688#comment-15277688 ] Sandeep Singh commented on SPARK-15228: --- Seems to be fixed in the current master. Before !http://i.imgur.com/aiIP4u3.png! After !http://i.imgur.com/jSpIySj.png! > pyspark.RDD.toLocalIterator Documentation > - > > Key: SPARK-15228 > URL: https://issues.apache.org/jira/browse/SPARK-15228 > Project: Spark > Issue Type: Documentation >Reporter: Ignacio Tartavull >Priority: Trivial > > There is a little bug in the parsing of the documentation of > http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14162) java.lang.IllegalStateException: Did not find registered driver with class oracle.jdbc.OracleDriver
[ https://issues.apache.org/jira/browse/SPARK-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277682#comment-15277682 ] Sun Rui commented on SPARK-14162: - We met the same error. The cause is that in one worker node, mysql JDBC driver is not in the CLASSPATH. [~mchalek] It seems this is not a bug. It seems that in your case, for some reason, in one worker node, ojdbc6 driver is not automatically loaded and registered. > java.lang.IllegalStateException: Did not find registered driver with class > oracle.jdbc.OracleDriver > --- > > Key: SPARK-14162 > URL: https://issues.apache.org/jira/browse/SPARK-14162 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Zoltan Fedor > > This is an interesting one. > We are using JupyterHub with Python to connect to a Hadoop cluster to run > Spark jobs and as the new Spark versions come out I compile them and add as > new kernels to JupyterHub to be used. > There are also some libraries we are using, like ojdbc to connect to an > Oracle database. > Now the interesting thing, that ojdbc worked fine in Spark 1.6.0 but suddenly > "it cannot be found" in 1.6.1. > Everything, all settings are the same when starting pyspark 1.6.1 and 1.6.0, > so there is no reason for it not to work in 1.6.1 if it works in 1.6.0. > This is the pysparjk code I am running in both 1.6.1 and 1.6.0: > {quote} > df = > sqlContext.read.format('jdbc').options(url='jdbc:oracle:thin:'+connection_script+'', > dbtable='bi.contact').load() > print(df.count()){quote} > And it throws this error in 1.6.1 only: > {quote} > java.lang.IllegalStateException: Did not find registered driver with class > oracle.jdbc.OracleDriver > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2$$anonfun$3.apply(JdbcUtils.scala:58) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2$$anonfun$3.apply(JdbcUtils.scala:58) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:57) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:347) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){quote} > I know that this usually means that the ojdbc driver is not available on the > executor, but it is. Spark is being started the exact same way in 1.6.1 as in > 1.6.0 and it does find it on 1.6.0. > I can steadily reproduce this, so the only conclusion that something must > have changed between 1.6.0 and 1.6.1 causing this, but I have see no > "depreciation" notice of anything what could cause this. > Environment variables set when starting pyspark 1.6.1: > {quote} > "SPARK_HOME": "/usr/lib/spark-1.6.1-hive", > "SCALA_HOME": "/usr/lib/scala", > "HADOOP_CONF_DIR": "/etc/hadoop/venus-hadoop-conf", > "HADOOP_HOME": "/usr/bin/hadoop", > "HIVE_HOME": "/usr/bin/hive", > "LD_LIBRARY_PATH": "/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH", > "YARN_HOME": "", > "SPARK_DIST_CLASSPATH": > "/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*", > "SPARK_LIBRARY_PATH": "/usr/lib/hadoop/lib", > "PAT
[jira] [Commented] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation
[ https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277681#comment-15277681 ] Sean Owen commented on SPARK-15228: --- Still not clear what that means. Can you make a pull request? > pyspark.RDD.toLocalIterator Documentation > - > > Key: SPARK-15228 > URL: https://issues.apache.org/jira/browse/SPARK-15228 > Project: Spark > Issue Type: Documentation >Reporter: Ignacio Tartavull >Priority: Trivial > > There is a little bug in the parsing of the documentation of > http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15246) Fix code style and improve volatile for SPARK-4452
[ https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15246: Assignee: (was: Apache Spark) > Fix code style and improve volatile for SPARK-4452 > --- > > Key: SPARK-15246 > URL: https://issues.apache.org/jira/browse/SPARK-15246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Lianhui Wang > > for SPARK-4452 > 1. Fix code style > 2. remote volatile of elementsRead method because there is only one thread to > use it. > 3. avoid volatile of _elementsRead because Collection increases number of > _elementsRead when it insert a element. It is very expensive. So we can avoid > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15246) Fix code style and improve volatile for SPARK-4452
[ https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-15246: - Summary: Fix code style and improve volatile for SPARK-4452 (was: Fix code style and improve volatile for Spillable) > Fix code style and improve volatile for SPARK-4452 > --- > > Key: SPARK-15246 > URL: https://issues.apache.org/jira/browse/SPARK-15246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Lianhui Wang > > for SPARK-4452 > 1. Fix code style > 2. remote volatile of elementsRead method because there is only one thread to > use it. > 3. avoid volatile of _elementsRead because Collection increases number of > _elementsRead when it insert a element. It is very expensive. So we can avoid > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15246) Fix code style and improve volatile for SPARK-4452
[ https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277672#comment-15277672 ] Apache Spark commented on SPARK-15246: -- User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/13020 > Fix code style and improve volatile for SPARK-4452 > --- > > Key: SPARK-15246 > URL: https://issues.apache.org/jira/browse/SPARK-15246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Lianhui Wang > > for SPARK-4452 > 1. Fix code style > 2. remote volatile of elementsRead method because there is only one thread to > use it. > 3. avoid volatile of _elementsRead because Collection increases number of > _elementsRead when it insert a element. It is very expensive. So we can avoid > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15246) Fix code style and improve volatile for SPARK-4452
[ https://issues.apache.org/jira/browse/SPARK-15246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15246: Assignee: Apache Spark > Fix code style and improve volatile for SPARK-4452 > --- > > Key: SPARK-15246 > URL: https://issues.apache.org/jira/browse/SPARK-15246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Lianhui Wang >Assignee: Apache Spark > > for SPARK-4452 > 1. Fix code style > 2. remote volatile of elementsRead method because there is only one thread to > use it. > 3. avoid volatile of _elementsRead because Collection increases number of > _elementsRead when it insert a element. It is very expensive. So we can avoid > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15246) Fix code style and improve volatile for Spillable
Lianhui Wang created SPARK-15246: Summary: Fix code style and improve volatile for Spillable Key: SPARK-15246 URL: https://issues.apache.org/jira/browse/SPARK-15246 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Lianhui Wang for SPARK-4452 1. Fix code style 2. remote volatile of elementsRead method because there is only one thread to use it. 3. avoid volatile of _elementsRead because Collection increases number of _elementsRead when it insert a element. It is very expensive. So we can avoid it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
[ https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15245: Assignee: Apache Spark > stream API throws an exception with an incorrect message when the path is not > a direcotry > - > > Key: SPARK-15245 > URL: https://issues.apache.org/jira/browse/SPARK-15245 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > {code} > val path = "tmp.csv" // This is not a directory > val cars = spark.read > .format("csv") > .stream(path) > .write > .option("checkpointLocation", "streaming.metadata") > .startStream("tmp") > {code} > This throws an exception as below. > {code} > java.lang.IllegalArgumentException: Option 'basePath' must be a directory > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) > at > org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be > great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
[ https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277647#comment-15277647 ] Apache Spark commented on SPARK-15245: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/13021 > stream API throws an exception with an incorrect message when the path is not > a direcotry > - > > Key: SPARK-15245 > URL: https://issues.apache.org/jira/browse/SPARK-15245 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > {code} > val path = "tmp.csv" // This is not a directory > val cars = spark.read > .format("csv") > .stream(path) > .write > .option("checkpointLocation", "streaming.metadata") > .startStream("tmp") > {code} > This throws an exception as below. > {code} > java.lang.IllegalArgumentException: Option 'basePath' must be a directory > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) > at > org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be > great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
[ https://issues.apache.org/jira/browse/SPARK-15245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15245: Assignee: (was: Apache Spark) > stream API throws an exception with an incorrect message when the path is not > a direcotry > - > > Key: SPARK-15245 > URL: https://issues.apache.org/jira/browse/SPARK-15245 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Priority: Trivial > > {code} > val path = "tmp.csv" // This is not a directory > val cars = spark.read > .format("csv") > .stream(path) > .write > .option("checkpointLocation", "streaming.metadata") > .startStream("tmp") > {code} > This throws an exception as below. > {code} > java.lang.IllegalArgumentException: Option 'basePath' must be a directory > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) > at > org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) > at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) > at > org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > {code} > It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be > great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15245) stream API throws an exception with an incorrect message when the path is not a direcotry
Hyukjin Kwon created SPARK-15245: Summary: stream API throws an exception with an incorrect message when the path is not a direcotry Key: SPARK-15245 URL: https://issues.apache.org/jira/browse/SPARK-15245 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon Priority: Trivial {code} val path = "tmp.csv" // This is not a directory val cars = spark.read .format("csv") .stream(path) .write .option("checkpointLocation", "streaming.metadata") .startStream("tmp") {code} This throws an exception as below. {code} java.lang.IllegalArgumentException: Option 'basePath' must be a directory at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.basePaths(PartitioningAwareFileCatalog.scala:180) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.inferPartitioning(PartitioningAwareFileCatalog.scala:117) at org.apache.spark.sql.execution.datasources.ListingFileCatalog.partitionSpec(ListingFileCatalog.scala:54) at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.allFiles(PartitioningAwareFileCatalog.scala:65) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$dataFrameBuilder$1(DataSource.scala:197) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$createSource$1.apply(DataSource.scala:201) at org.apache.spark.sql.execution.streaming.FileStreamSource.getBatch(FileStreamSource.scala:101) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:310) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) {code} It seems {{path}} is set to {{basePath}} in {{DataSource}}. This might be great if it has a better message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15207) Use Travis CI for Java Linter and JDK7/8 compilation test
[ https://issues.apache.org/jira/browse/SPARK-15207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15207: -- Description: Currently, Java Linter is disabled in Jenkins tests. https://github.com/apache/spark/blob/master/dev/run-tests.py#L554 However, as of today, Spark has 721 java files with 97362 code (without blank/comments). It's about 1/3 of Scala. {code} Language files blankcomment code Scala 2353 62819 124060 318747 Java721 18617 23314 97362 {code} This issue aims to take advantage of Travis CI to handle the following static analysis by adding a single file, `.travis.yml` without any additional burden on the existing servers. - Java Linter - JDK7/JDK8 maven compile Note that this issue does not propose to remove some of the above work items from the Jenkins. It's possible, but we need to observe the Travis CI stability for a while. The goal of this issue is to removing committer's overhead on linter-related PRs (the original PR and the fixation PR). By the way, historically, Spark used Travis CI before. was: Currently, Java Linter is disabled in Jenkins tests. https://github.com/apache/spark/blob/master/dev/run-tests.py#L554 This issue aims to take advantage of Travis CI to handle the following static analysis by adding a single file, `.travis.yml` without any additional burden on the existing servers. - Java Linter - JDK7/JDK8 maven compile Note that this issue does not propose to remove some of the above work items from the Jenkins. It's possible, but we need to observe the Travis CI stability for a while. The goal of this issue is to removing committer's overhead on linter-related PRs (the original PR and the fixation PR). By the way, historically, Spark used Travis CI before. > Use Travis CI for Java Linter and JDK7/8 compilation test > - > > Key: SPARK-15207 > URL: https://issues.apache.org/jira/browse/SPARK-15207 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Dongjoon Hyun > > Currently, Java Linter is disabled in Jenkins tests. > https://github.com/apache/spark/blob/master/dev/run-tests.py#L554 > However, as of today, Spark has 721 java files with 97362 code (without > blank/comments). It's about 1/3 of Scala. > {code} > > Language files blankcomment > code > > Scala 2353 62819 124060 > 318747 > Java721 18617 23314 > 97362 > {code} > This issue aims to take advantage of Travis CI to handle the following static > analysis by adding a single file, `.travis.yml` without any additional burden > on the existing servers. > - Java Linter > - JDK7/JDK8 maven compile > Note that this issue does not propose to remove some of the above work items > from the Jenkins. It's possible, but we need to observe the Travis CI > stability for a while. The goal of this issue is to removing committer's > overhead on linter-related PRs (the original PR and the fixation PR). > By the way, historically, Spark used Travis CI before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15207) Use Travis CI for Java Linter and JDK7/8 compilation test
[ https://issues.apache.org/jira/browse/SPARK-15207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15207: -- Description: Currently, Java Linter is disabled in Jenkins tests. https://github.com/apache/spark/blob/master/dev/run-tests.py#L554 This issue aims to take advantage of Travis CI to handle the following static analysis by adding a single file, `.travis.yml` without any additional burden on the existing servers. - Java Linter - JDK7/JDK8 maven compile Note that this issue does not propose to remove some of the above work items from the Jenkins. It's possible, but we need to observe the Travis CI stability for a while. The goal of this issue is to removing committer's overhead on linter-related PRs (the original PR and the fixation PR). By the way, historically, Spark used Travis CI before. was: Currently, Java Linter is disabled due to lack of testing resources. https://github.com/apache/spark/blob/master/dev/run-tests.py#L554 This issue aims to take advantage of Travis CI to handle the following static analysis by adding a single file, `.travis.yml` without any additional burden on the existing servers. - Apache License Check - Python Linter - Scala Linter - Java Linter - JDK7/JDK8 maven compile Note that this issue does not propose to remove some of the above work items from the Jenkins. It's possible, but we need to observe the Travis CI stability for a while. The goal of this issue is to removing committer's overhead on linter-related PRs (the original PR and the fixation PR). By the way, historically, Spark used Travis CI before. Summary: Use Travis CI for Java Linter and JDK7/8 compilation test (was: Use Travis CI for Java/Scala/Python Linter and JDK7/8 compilation test) > Use Travis CI for Java Linter and JDK7/8 compilation test > - > > Key: SPARK-15207 > URL: https://issues.apache.org/jira/browse/SPARK-15207 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Dongjoon Hyun > > Currently, Java Linter is disabled in Jenkins tests. > https://github.com/apache/spark/blob/master/dev/run-tests.py#L554 > This issue aims to take advantage of Travis CI to handle the following static > analysis by adding a single file, `.travis.yml` without any additional burden > on the existing servers. > - Java Linter > - JDK7/JDK8 maven compile > Note that this issue does not propose to remove some of the above work items > from the Jenkins. It's possible, but we need to observe the Travis CI > stability for a while. The goal of this issue is to removing committer's > overhead on linter-related PRs (the original PR and the fixation PR). > By the way, historically, Spark used Travis CI before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15218) Error: Could not find or load main class org.apache.spark.launcher.Main when run from a directory containing colon ':'
[ https://issues.apache.org/jira/browse/SPARK-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277610#comment-15277610 ] Adam Cecile commented on SPARK-15218: - Hello, Good catch, this mesos bug should fix my issue. I'm using mesosphere debian package, up to date, so I should receive the fix soon. What about hacking around the SPARk_HOME variable within the shell wrapper to add backslashes? No luck? > Error: Could not find or load main class org.apache.spark.launcher.Main when > run from a directory containing colon ':' > -- > > Key: SPARK-15218 > URL: https://issues.apache.org/jira/browse/SPARK-15218 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 1.6.1 >Reporter: Adam Cecile > Labels: mesos > > {noformat} > mkdir /tmp/qwe:rtz > cd /tmp/qwe:rtz > wget > http://www-eu.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-without-hadoop.tgz > tar xvzf spark-1.6.1-bin-without-hadoop.tgz > cd spark-1.6.1-bin-without-hadoop/ > bin/spark-submit > {noformat} > Returns "Error: Could not find or load main class > org.apache.spark.launcher.Main". > That would not be such an issue if Mesos executor did not have colon in the > generated paths. It means withtout hacking (define relative SPARK_HOME path > by myself) there's no way to run a spark-job insode a mesos job container... > Best regards, Adam. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15204) Improve nullability inference for Aggregator
[ https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277600#comment-15277600 ] Koert Kuipers commented on SPARK-15204: --- Makes sense > Improve nullability inference for Aggregator > > > Key: SPARK-15204 > URL: https://issues.apache.org/jira/browse/SPARK-15204 > Project: Spark > Issue Type: Improvement > Components: SQL > Environment: spark-2.0.0-SNAPSHOT >Reporter: koert kuipers >Priority: Minor > > {noformat} > object SimpleSum extends Aggregator[Row, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Row) = b + a.getInt(1) > def merge(b1: Int, b2: Int) = b1 + b2 > def finish(b: Int) = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v") > val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1") > df1.printSchema > df1.show > root > |-- k: string (nullable = true) > |-- v1: integer (nullable = true) > +---+---+ > | k| v1| > +---+---+ > | a| 6| > +---+---+ > {noformat} > notice how v1 has nullable set to true. the default (and expected) behavior > for spark sql is to give an int column false for nullable. for example if i > had uses a built-in aggregator like "sum" instead if would have reported > nullable = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15244) Type of column name created with sqlContext.createDataFrame() is not consistent.
Kazuki Yokoishi created SPARK-15244: --- Summary: Type of column name created with sqlContext.createDataFrame() is not consistent. Key: SPARK-15244 URL: https://issues.apache.org/jira/browse/SPARK-15244 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Environment: CentOS 7, Spark 1.6.0 Reporter: Kazuki Yokoishi Priority: Minor StructField() converts field name to str in __init__. But, when list of str/unicode is passed to sqlContext.createDataFrame() as a schema, the type of StructField.name is not converted. To reproduce: {noformat} >>> schema = StructType([StructField(u"col", StringType())]) >>> df1 = sqlContext.createDataFrame([("a",)], schema) >>> df1.columns # "col" is str ['col'] >>> df2 = sqlContext.createDataFrame([("a",)], [u"col"]) >>> df2.columns # "col" is unicode [u'col'] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15243) Binarizer.explainParam(u"...") raises ValueError
Kazuki Yokoishi created SPARK-15243: --- Summary: Binarizer.explainParam(u"...") raises ValueError Key: SPARK-15243 URL: https://issues.apache.org/jira/browse/SPARK-15243 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Environment: CentOS 7, Spark 1.6.0 Reporter: Kazuki Yokoishi Priority: Minor When unicode is passed to Binarizer.explainParam(), ValueError occurs. To reproduce: {noformat} >>> binarizer = Binarizer(threshold=1.0, inputCol="values", >>> outputCol="features") >>> binarizer.explainParam("threshold") # str can be passed 'threshold: threshold in binary classification prediction, in range [0, 1] (default: 0.0, current: 1.0)' >>> binarizer.explainParam(u"threshold") # unicode cannot be passed --- ValueErrorTraceback (most recent call last) in () > 1 binarizer.explainParam(u"threshold") /usr/spark/current/python/pyspark/ml/param/__init__.py in explainParam(self, param) 96 default value and user-supplied value in a string. 97 """ ---> 98 param = self._resolveParam(param) 99 values = [] 100 if self.isDefined(param): /usr/spark/current/python/pyspark/ml/param/__init__.py in _resolveParam(self, param) 231 return self.getParam(param) 232 else: --> 233 raise ValueError("Cannot resolve %r as a param." % param) 234 235 @staticmethod ValueError: Cannot resolve u'threshold' as a param. {noformat} Same erros occur in other methods. * Binarizer.hasDefault() * Binarizer.getOrDefault() * Binarizer.isSet() These errors are caused by checks *isinstance(obj, str)* in pyspark.ml.param.Params._resolveParam(). basestring should be used instead of str in isinstance() for backward compatibility as below. {noformat} if sys.version >= '3': basestring = str if isinstance(obj, basestring): # TODO {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15204) Improve nullability inference for Aggregator
[ https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-15204. --- Resolution: Not A Bug > Improve nullability inference for Aggregator > > > Key: SPARK-15204 > URL: https://issues.apache.org/jira/browse/SPARK-15204 > Project: Spark > Issue Type: Improvement > Components: SQL > Environment: spark-2.0.0-SNAPSHOT >Reporter: koert kuipers >Priority: Minor > > {noformat} > object SimpleSum extends Aggregator[Row, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Row) = b + a.getInt(1) > def merge(b1: Int, b2: Int) = b1 + b2 > def finish(b: Int) = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v") > val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1") > df1.printSchema > df1.show > root > |-- k: string (nullable = true) > |-- v1: integer (nullable = true) > +---+---+ > | k| v1| > +---+---+ > | a| 6| > +---+---+ > {noformat} > notice how v1 has nullable set to true. the default (and expected) behavior > for spark sql is to give an int column false for nullable. for example if i > had uses a built-in aggregator like "sum" instead if would have reported > nullable = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15204) Improve nullability inference for Aggregator
[ https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-15204: - > Improve nullability inference for Aggregator > > > Key: SPARK-15204 > URL: https://issues.apache.org/jira/browse/SPARK-15204 > Project: Spark > Issue Type: Improvement > Components: SQL > Environment: spark-2.0.0-SNAPSHOT >Reporter: koert kuipers >Priority: Minor > > {noformat} > object SimpleSum extends Aggregator[Row, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Row) = b + a.getInt(1) > def merge(b1: Int, b2: Int) = b1 + b2 > def finish(b: Int) = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v") > val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1") > df1.printSchema > df1.show > root > |-- k: string (nullable = true) > |-- v1: integer (nullable = true) > +---+---+ > | k| v1| > +---+---+ > | a| 6| > +---+---+ > {noformat} > notice how v1 has nullable set to true. the default (and expected) behavior > for spark sql is to give an int column false for nullable. for example if i > had uses a built-in aggregator like "sum" instead if would have reported > nullable = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15204) Improve nullability inference for Aggregator
[ https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15204: Summary: Improve nullability inference for Aggregator (was: Nullable is not correct for Aggregator) > Improve nullability inference for Aggregator > > > Key: SPARK-15204 > URL: https://issues.apache.org/jira/browse/SPARK-15204 > Project: Spark > Issue Type: Improvement > Components: SQL > Environment: spark-2.0.0-SNAPSHOT >Reporter: koert kuipers >Priority: Minor > > {noformat} > object SimpleSum extends Aggregator[Row, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Row) = b + a.getInt(1) > def merge(b1: Int, b2: Int) = b1 + b2 > def finish(b: Int) = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v") > val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1") > df1.printSchema > df1.show > root > |-- k: string (nullable = true) > |-- v1: integer (nullable = true) > +---+---+ > | k| v1| > +---+---+ > | a| 6| > +---+---+ > {noformat} > notice how v1 has nullable set to true. the default (and expected) behavior > for spark sql is to give an int column false for nullable. for example if i > had uses a built-in aggregator like "sum" instead if would have reported > nullable = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15204) Nullable is not correct for Aggregator
[ https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15204: Issue Type: Improvement (was: Bug) > Nullable is not correct for Aggregator > -- > > Key: SPARK-15204 > URL: https://issues.apache.org/jira/browse/SPARK-15204 > Project: Spark > Issue Type: Improvement > Components: SQL > Environment: spark-2.0.0-SNAPSHOT >Reporter: koert kuipers >Priority: Minor > > {noformat} > object SimpleSum extends Aggregator[Row, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Row) = b + a.getInt(1) > def merge(b1: Int, b2: Int) = b1 + b2 > def finish(b: Int) = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v") > val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1") > df1.printSchema > df1.show > root > |-- k: string (nullable = true) > |-- v1: integer (nullable = true) > +---+---+ > | k| v1| > +---+---+ > | a| 6| > +---+---+ > {noformat} > notice how v1 has nullable set to true. the default (and expected) behavior > for spark sql is to give an int column false for nullable. for example if i > had uses a built-in aggregator like "sum" instead if would have reported > nullable = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15204) Nullable is not correct for Aggregator
[ https://issues.apache.org/jira/browse/SPARK-15204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277592#comment-15277592 ] Reynold Xin commented on SPARK-15204: - I'm going to change the title because it is always correct to assume "nullable". It is just better if we can tighten the nullability when possible. This is not a "bug". > Nullable is not correct for Aggregator > -- > > Key: SPARK-15204 > URL: https://issues.apache.org/jira/browse/SPARK-15204 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: spark-2.0.0-SNAPSHOT >Reporter: koert kuipers >Priority: Minor > > {noformat} > object SimpleSum extends Aggregator[Row, Int, Int] { > def zero: Int = 0 > def reduce(b: Int, a: Row) = b + a.getInt(1) > def merge(b1: Int, b2: Int) = b1 + b2 > def finish(b: Int) = b > def bufferEncoder: Encoder[Int] = Encoders.scalaInt > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > val df = List(("a", 1), ("a", 2), ("a", 3)).toDF("k", "v") > val df1 = df.groupBy("k").agg(SimpleSum.toColumn as "v1") > df1.printSchema > df1.show > root > |-- k: string (nullable = true) > |-- v1: integer (nullable = true) > +---+---+ > | k| v1| > +---+---+ > | a| 6| > +---+---+ > {noformat} > notice how v1 has nullable set to true. the default (and expected) behavior > for spark sql is to give an int column false for nullable. for example if i > had uses a built-in aggregator like "sum" instead if would have reported > nullable = false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory
[ https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277583#comment-15277583 ] Apache Spark commented on SPARK-4452: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/13020 > Shuffle data structures can starve others on the same thread for memory > > > Key: SPARK-4452 > URL: https://issues.apache.org/jira/browse/SPARK-4452 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Tianshuo Deng >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > When an Aggregator is used with ExternalSorter in a task, spark will create > many small files and could cause too many files open error during merging. > Currently, ShuffleMemoryManager does not work well when there are 2 spillable > objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used > by Aggregator) in this case. Here is an example: Due to the usage of mapside > aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may > ask as much memory as it can, which is totalMem/numberOfThreads. Then later > on when ExternalSorter is created in the same thread, the > ShuffleMemoryManager could refuse to allocate more memory to it, since the > memory is already given to the previous requested > object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling > small files(due to the lack of memory) > I'm currently working on a PR to address these two issues. It will include > following changes: > 1. The ShuffleMemoryManager should not only track the memory usage for each > thread, but also the object who holds the memory > 2. The ShuffleMemoryManager should be able to trigger the spilling of a > spillable object. In this way, if a new object in a thread is requesting > memory, the old occupant could be evicted/spilled. Previously the spillable > objects trigger spilling by themselves. So one may not trigger spilling even > if another object in the same thread needs more memory. After this change The > ShuffleMemoryManager could trigger the spilling of an object if it needs to. > 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously > ExternalAppendOnlyMap returns an destructive iterator and can not be spilled > after the iterator is returned. This should be changed so that even after the > iterator is returned, the ShuffleMemoryManager can still spill it. > Currently, I have a working branch in progress: > https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made > change 3 and have a prototype of change 1 and 2 to evict spillable from > memory manager, still in progress. I will send a PR when it's done. > Any feedback or thoughts on this change is highly appreciated ! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15187) Disallow Dropping Default Database
[ https://issues.apache.org/jira/browse/SPARK-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-15187: Assignee: Xiao Li > Disallow Dropping Default Database > -- > > Key: SPARK-15187 > URL: https://issues.apache.org/jira/browse/SPARK-15187 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.0.0 > > > We should disallow users to drop default database, like what hive metastore > did. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15187) Disallow Dropping Default Database
[ https://issues.apache.org/jira/browse/SPARK-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15187. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12962 [https://github.com/apache/spark/pull/12962] > Disallow Dropping Default Database > -- > > Key: SPARK-15187 > URL: https://issues.apache.org/jira/browse/SPARK-15187 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > Fix For: 2.0.0 > > > We should disallow users to drop default database, like what hive metastore > did. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15241) support scala decimal in external row
[ https://issues.apache.org/jira/browse/SPARK-15241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15241: Assignee: Apache Spark (was: Wenchen Fan) > support scala decimal in external row > - > > Key: SPARK-15241 > URL: https://issues.apache.org/jira/browse/SPARK-15241 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal
[ https://issues.apache.org/jira/browse/SPARK-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15242: Assignee: Wenchen Fan (was: Apache Spark) > keep decimal precision and scale when convert external decimal to catalyst > decimal > -- > > Key: SPARK-15242 > URL: https://issues.apache.org/jira/browse/SPARK-15242 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15241) support scala decimal in external row
[ https://issues.apache.org/jira/browse/SPARK-15241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277565#comment-15277565 ] Apache Spark commented on SPARK-15241: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13019 > support scala decimal in external row > - > > Key: SPARK-15241 > URL: https://issues.apache.org/jira/browse/SPARK-15241 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal
[ https://issues.apache.org/jira/browse/SPARK-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277566#comment-15277566 ] Apache Spark commented on SPARK-15242: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13019 > keep decimal precision and scale when convert external decimal to catalyst > decimal > -- > > Key: SPARK-15242 > URL: https://issues.apache.org/jira/browse/SPARK-15242 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15241) support scala decimal in external row
[ https://issues.apache.org/jira/browse/SPARK-15241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15241: Assignee: Wenchen Fan (was: Apache Spark) > support scala decimal in external row > - > > Key: SPARK-15241 > URL: https://issues.apache.org/jira/browse/SPARK-15241 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal
[ https://issues.apache.org/jira/browse/SPARK-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15242: Assignee: Apache Spark (was: Wenchen Fan) > keep decimal precision and scale when convert external decimal to catalyst > decimal > -- > > Key: SPARK-15242 > URL: https://issues.apache.org/jira/browse/SPARK-15242 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal
Wenchen Fan created SPARK-15242: --- Summary: keep decimal precision and scale when convert external decimal to catalyst decimal Key: SPARK-15242 URL: https://issues.apache.org/jira/browse/SPARK-15242 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15241) support scala decimal in external row
Wenchen Fan created SPARK-15241: --- Summary: support scala decimal in external row Key: SPARK-15241 URL: https://issues.apache.org/jira/browse/SPARK-15241 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15240) Use buffer variables to improve buffer serialization/deserialization in TungstenAggregate
[ https://issues.apache.org/jira/browse/SPARK-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-15240: Summary: Use buffer variables to improve buffer serialization/deserialization in TungstenAggregate (was: Use buffer variables for update/merge expressions instead duplicate serialization/deserialization) > Use buffer variables to improve buffer serialization/deserialization in > TungstenAggregate > - > > Key: SPARK-15240 > URL: https://issues.apache.org/jira/browse/SPARK-15240 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > We do serialization/deserialization on aggregation buffer in > TungstenAggregate for each aggregation function. It wastes time on duplicate > serde for the same grouping keys. We can optimize it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15240) Use buffer variables for update/merge expressions instead duplicate serialization/deserialization
[ https://issues.apache.org/jira/browse/SPARK-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15240: Assignee: (was: Apache Spark) > Use buffer variables for update/merge expressions instead duplicate > serialization/deserialization > - > > Key: SPARK-15240 > URL: https://issues.apache.org/jira/browse/SPARK-15240 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > We do serialization/deserialization on aggregation buffer in > TungstenAggregate for each aggregation function. It wastes time on duplicate > serde for the same grouping keys. We can optimize it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15240) Use buffer variables for update/merge expressions instead duplicate serialization/deserialization
[ https://issues.apache.org/jira/browse/SPARK-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277536#comment-15277536 ] Apache Spark commented on SPARK-15240: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/13018 > Use buffer variables for update/merge expressions instead duplicate > serialization/deserialization > - > > Key: SPARK-15240 > URL: https://issues.apache.org/jira/browse/SPARK-15240 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > We do serialization/deserialization on aggregation buffer in > TungstenAggregate for each aggregation function. It wastes time on duplicate > serde for the same grouping keys. We can optimize it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15240) Use buffer variables for update/merge expressions instead duplicate serialization/deserialization
[ https://issues.apache.org/jira/browse/SPARK-15240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15240: Assignee: Apache Spark > Use buffer variables for update/merge expressions instead duplicate > serialization/deserialization > - > > Key: SPARK-15240 > URL: https://issues.apache.org/jira/browse/SPARK-15240 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > We do serialization/deserialization on aggregation buffer in > TungstenAggregate for each aggregation function. It wastes time on duplicate > serde for the same grouping keys. We can optimize it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15240) Use buffer variables for update/merge expressions instead duplicate serialization/deserialization
Liang-Chi Hsieh created SPARK-15240: --- Summary: Use buffer variables for update/merge expressions instead duplicate serialization/deserialization Key: SPARK-15240 URL: https://issues.apache.org/jira/browse/SPARK-15240 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh We do serialization/deserialization on aggregation buffer in TungstenAggregate for each aggregation function. It wastes time on duplicate serde for the same grouping keys. We can optimize it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15229) Make case sensitivity setting internal
[ https://issues.apache.org/jira/browse/SPARK-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15229. - Resolution: Fixed Fix Version/s: 2.0.0 > Make case sensitivity setting internal > -- > > Key: SPARK-15229 > URL: https://issues.apache.org/jira/browse/SPARK-15229 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > Our case sensitivity support is different from what ANSI SQL standards > support. Postgres' behavior is that if an identifier is quoted, then it is > treated as case sensitive, otherwise it is folded to lower case. We will > likely need to revisit this in the future and change our behavior. For now, > the safest change to do for Spark 2.0 is to make the case sensitive option > internal and discourage users from turning it on, effectively making Spark > always case insensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
[ https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15234. - Resolution: Fixed Fix Version/s: 2.0.0 > spark.catalog.listDatabases.show() is not formatted correctly > - > > Key: SPARK-15234 > URL: https://issues.apache.org/jira/browse/SPARK-15234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > {code} > scala> spark.catalog.listDatabases.show() > ++---+---+ > |name|description|locationUri| > ++---+---+ > |Database[name='de...| > |Database[name='my...| > |Database[name='so...| > ++---+---+ > {code} > It's because org.apache.spark.sql.catalog.Database is not a case class! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277473#comment-15277473 ] Shivaram Venkataraman commented on SPARK-12661: --- Yes - We will do a full round of AMI updates as well to get Python 2.7, Scala 2.11 and Java 8 on the EC2 images. I plan to do that once we have 2.0 RC1 > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15238) Clarify Python 3 support in docs
[ https://issues.apache.org/jira/browse/SPARK-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15238: Assignee: Apache Spark > Clarify Python 3 support in docs > > > Key: SPARK-15238 > URL: https://issues.apache.org/jira/browse/SPARK-15238 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Assignee: Apache Spark >Priority: Trivial > > The [current doc|http://spark.apache.org/docs/1.6.1/] reads: > {quote} > Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 > uses Scala 2.10. You will need to use a compatible Scala version (2.10.x). > {quote} > Projects that support Python 3 generally mention that explicitly. A casual > Python user might assume from this line that Spark supports Python 2.6 and > 2.7 but not 3+. > More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ > and not earlier versions of Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15238) Clarify Python 3 support in docs
[ https://issues.apache.org/jira/browse/SPARK-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15238: Assignee: (was: Apache Spark) > Clarify Python 3 support in docs > > > Key: SPARK-15238 > URL: https://issues.apache.org/jira/browse/SPARK-15238 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Trivial > > The [current doc|http://spark.apache.org/docs/1.6.1/] reads: > {quote} > Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 > uses Scala 2.10. You will need to use a compatible Scala version (2.10.x). > {quote} > Projects that support Python 3 generally mention that explicitly. A casual > Python user might assume from this line that Spark supports Python 2.6 and > 2.7 but not 3+. > More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ > and not earlier versions of Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15238) Clarify Python 3 support in docs
[ https://issues.apache.org/jira/browse/SPARK-15238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277468#comment-15277468 ] Apache Spark commented on SPARK-15238: -- User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/13017 > Clarify Python 3 support in docs > > > Key: SPARK-15238 > URL: https://issues.apache.org/jira/browse/SPARK-15238 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Reporter: Nicholas Chammas >Priority: Trivial > > The [current doc|http://spark.apache.org/docs/1.6.1/] reads: > {quote} > Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 > uses Scala 2.10. You will need to use a compatible Scala version (2.10.x). > {quote} > Projects that support Python 3 generally mention that explicitly. A casual > Python user might assume from this line that Spark supports Python 2.6 and > 2.7 but not 3+. > More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ > and not earlier versions of Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15239) spark document conflict about mesos cluster
JasonChang created SPARK-15239: -- Summary: spark document conflict about mesos cluster Key: SPARK-15239 URL: https://issues.apache.org/jira/browse/SPARK-15239 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.6.1 Reporter: JasonChang 1. http://spark.apache.org/docs/latest/submitting-applications.html if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for Mesos clusters. Currently only YARN supports cluster mode for Python applications. 2. http://spark.apache.org/docs/latest/running-on-mesos.html Spark on Mesos also supports cluster mode, where the driver is launched in the cluster and the client can find the results of the driver from the Mesos Web UI. I confused does mesos supports cluster mode? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15238) Clarify Python 3 support in docs
Nicholas Chammas created SPARK-15238: Summary: Clarify Python 3 support in docs Key: SPARK-15238 URL: https://issues.apache.org/jira/browse/SPARK-15238 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Reporter: Nicholas Chammas Priority: Trivial The [current doc|http://spark.apache.org/docs/1.6.1/] reads: {quote} Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.1 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x). {quote} Projects that support Python 3 generally mention that explicitly. A casual Python user might assume from this line that Spark supports Python 2.6 and 2.7 but not 3+. More specifically, I gather from SPARK-4897 that Spark actually supports 3.4+ and not earlier versions of Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15237) SparkR corr function documentation
[ https://issues.apache.org/jira/browse/SPARK-15237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277454#comment-15277454 ] Sun Rui commented on SPARK-15237: - SparkR supports two types of corr(), something like corr(SparkDataFrame, "col1", "col2") and corr(df$col1, df$col2). But the documentation of corr seems containing only the example for the latter. [~felixcheung] how to fix the documentation? > SparkR corr function documentation > -- > > Key: SPARK-15237 > URL: https://issues.apache.org/jira/browse/SPARK-15237 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 1.6.0, 1.6.1 >Reporter: Shaul >Priority: Minor > Labels: corr, sparkr > > Please review the documentation of the corr function in SparkR, the example > given: corr(df$c, df$d) won't run. The correct usage seems to be > corr(dataFrame,"someColumn","OtherColumn"), is this correct? > Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277450#comment-15277450 ] Nicholas Chammas commented on SPARK-12661: -- [~davies] / [~joshrosen] - Has this been settled on? The dev list discussion from January seemed to converge almost unanimously on dropping Python 2.6 support in Spark 2.0, but I don't think an official decision was ever announced. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277448#comment-15277448 ] Nicholas Chammas commented on SPARK-12661: -- [~shivaram] - Can you confirm that spark-ec2 will drop support for Python 2.6 starting with Spark 2.0? For the record, I don't think it will affect things much either way anymore since spark-ec2 is now a separate project. > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277427#comment-15277427 ] Kai Chen commented on SPARK-13634: -- [~Rahul Palamuttam] and [~chrismattmann] Try {code} @transient val newSC = sc {code} in the REPL to prevent SparkContext from being dragged into the serialization graph. Cheers! > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-13485: - Labels: releasenotes (was: ) > (Dataset-oriented) API evolution in Spark 2.0 > - > > Key: SPARK-13485 > URL: https://issues.apache.org/jira/browse/SPARK-13485 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > Labels: releasenotes > Fix For: 2.0.0 > > Attachments: API Evolution in Spark 2.0.pdf > > > As part of Spark 2.0, we want to create a stable API foundation for Dataset > to become the main user-facing API in Spark. This ticket tracks various tasks > related to that. > The main high level changes are: > 1. Merge Dataset/DataFrame > 2. Create a more natural entry point for Dataset (SQLContext/HiveContext are > not ideal because of the name "SQL"/"Hive", and "SparkContext" is not ideal > because of its heavy dependency on RDDs) > 3. First class support for sessions > 4. First class support for some system catalog > See the design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15025) creating datasource table with option (PATH) results in duplicate path key in serdeProperties
[ https://issues.apache.org/jira/browse/SPARK-15025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15025: - Assignee: Xin Wu > creating datasource table with option (PATH) results in duplicate path key in > serdeProperties > - > > Key: SPARK-15025 > URL: https://issues.apache.org/jira/browse/SPARK-15025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu >Assignee: Xin Wu > Fix For: 2.0.0 > > > Repro: > {code}create table t1 using parquet options (PATH "/tmp/t1") as select 1 as > a, 2 as b{code} > This will create a hive external table whose dataLocation is > "/someDefaultPath", which is not the same as the provided one. Yet, > serdeInfo.parameters contain following key value pairs: > PATH, "/tmp/t1" > path, "/someDefaultPath" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15025) creating datasource table with option (PATH) results in duplicate path key in serdeProperties
[ https://issues.apache.org/jira/browse/SPARK-15025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15025. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12804 [https://github.com/apache/spark/pull/12804] > creating datasource table with option (PATH) results in duplicate path key in > serdeProperties > - > > Key: SPARK-15025 > URL: https://issues.apache.org/jira/browse/SPARK-15025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > Fix For: 2.0.0 > > > Repro: > {code}create table t1 using parquet options (PATH "/tmp/t1") as select 1 as > a, 2 as b{code} > This will create a hive external table whose dataLocation is > "/someDefaultPath", which is not the same as the provided one. Yet, > serdeInfo.parameters contain following key value pairs: > PATH, "/tmp/t1" > path, "/someDefaultPath" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15237) SparkR corr function documentation
Shaul created SPARK-15237: - Summary: SparkR corr function documentation Key: SPARK-15237 URL: https://issues.apache.org/jira/browse/SPARK-15237 Project: Spark Issue Type: Documentation Components: SparkR Affects Versions: 1.6.1, 1.6.0 Reporter: Shaul Priority: Minor Please review the documentation of the corr function in SparkR, the example given: corr(df$c, df$d) won't run. The correct usage seems to be corr(dataFrame,"someColumn","OtherColumn"), is this correct? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL
[ https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15236: -- Component/s: Spark Shell > No way to disable Hive support in REPL > -- > > Key: SPARK-15236 > URL: https://issues.apache.org/jira/browse/SPARK-15236 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > If you built Spark with Hive classes, there's no switch to flip to start a > new `spark-shell` using the InMemoryCatalog. The only thing you can do now is > to rebuild Spark again. That is quite inconvenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15019) Propagate all Spark Confs to HiveConf created in HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-15019: - Labels: release_notes releasenotes (was: ) > Propagate all Spark Confs to HiveConf created in HiveClientImpl > --- > > Key: SPARK-15019 > URL: https://issues.apache.org/jira/browse/SPARK-15019 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > Right now, the HiveConf created in HiveClientImpl only takes conf set at > runtime or set in hive-site.xml. We should also propagate Spark confs to it. > So, users do not have to use hive-site.xml to set warehouse location and > metastore url. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15019) Propagate all Spark Confs to HiveConf created in HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-15019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277312#comment-15277312 ] Yin Huai commented on SPARK-15019: -- This change also drops the support of hive-site.xml. Users should set all hive related confs to Spark conf. > Propagate all Spark Confs to HiveConf created in HiveClientImpl > --- > > Key: SPARK-15019 > URL: https://issues.apache.org/jira/browse/SPARK-15019 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > Right now, the HiveConf created in HiveClientImpl only takes conf set at > runtime or set in hive-site.xml. We should also propagate Spark confs to it. > So, users do not have to use hive-site.xml to set warehouse location and > metastore url. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15209) Web UI's timeline visualizations fails to render if descriptions contain single quotes
[ https://issues.apache.org/jira/browse/SPARK-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-15209. Resolution: Fixed > Web UI's timeline visualizations fails to render if descriptions contain > single quotes > -- > > Key: SPARK-15209 > URL: https://issues.apache.org/jira/browse/SPARK-15209 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.2, 2.0.0 > > > If a Spark job's job description contains a single quote (') then the driver > UI's job event timeline will fail to render due to Javascript errors. To > reproduce these symptoms, run > {code} > sc.setJobDescription("double quote: \" ") > sc.parallelize(1 to 10).count() > sc.setJobDescription("single quote: ' ") > sc.parallelize(1 to 10).count() > {code} > and browse to the driver UI. This will currently result in an "Uncaught > SyntaxError" because the single quote is not escaped and ends up closing a > Javascript string literal too early. > I think that a simple fix may be to change the relevant JS to use double > quotes and then to use the existing XML escaping logic to escape the string's > contents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15209) Web UI's timeline visualizations fails to render if descriptions contain single quotes
[ https://issues.apache.org/jira/browse/SPARK-15209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-15209: --- Fix Version/s: 2.0.0 1.6.2 > Web UI's timeline visualizations fails to render if descriptions contain > single quotes > -- > > Key: SPARK-15209 > URL: https://issues.apache.org/jira/browse/SPARK-15209 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.2, 2.0.0 > > > If a Spark job's job description contains a single quote (') then the driver > UI's job event timeline will fail to render due to Javascript errors. To > reproduce these symptoms, run > {code} > sc.setJobDescription("double quote: \" ") > sc.parallelize(1 to 10).count() > sc.setJobDescription("single quote: ' ") > sc.parallelize(1 to 10).count() > {code} > and browse to the driver UI. This will currently result in an "Uncaught > SyntaxError" because the single quote is not escaped and ends up closing a > Javascript string literal too early. > I think that a simple fix may be to change the relevant JS to use double > quotes and then to use the existing XML escaping logic to escape the string's > contents. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15236) No way to disable Hive support in REPL
[ https://issues.apache.org/jira/browse/SPARK-15236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15236: -- Assignee: (was: Andrew Or) > No way to disable Hive support in REPL > -- > > Key: SPARK-15236 > URL: https://issues.apache.org/jira/browse/SPARK-15236 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > If you built Spark with Hive classes, there's no switch to flip to start a > new `spark-shell` using the InMemoryCatalog. The only thing you can do now is > to rebuild Spark again. That is quite inconvenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15236) No way to disable Hive support in REPL
Andrew Or created SPARK-15236: - Summary: No way to disable Hive support in REPL Key: SPARK-15236 URL: https://issues.apache.org/jira/browse/SPARK-15236 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or If you built Spark with Hive classes, there's no switch to flip to start a new `spark-shell` using the InMemoryCatalog. The only thing you can do now is to rebuild Spark again. That is quite inconvenient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15235) Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline
[ https://issues.apache.org/jira/browse/SPARK-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15235: Assignee: (was: Apache Spark) > Corresponding row cannot be highlighted even though cursor is on the job on > Web UI's timeline > - > > Key: SPARK-15235 > URL: https://issues.apache.org/jira/browse/SPARK-15235 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0 >Reporter: Kousuke Saruta >Priority: Trivial > > To extract job descriptions and stage name, there are following regular > expressions in timeline-view.js > {code} > var jobIdText = > $($(baseElem).find(".application-timeline-content")[0]).text(); > var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1]; > ... > var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text(); > var stageIdAndAttempt = stageIdText.match("\\(Stage > (\\d+\\.\\d+)\\)")[1].split("."); > {code} > But if job descriptions include patterns like "(Job x)" or stage names > include patterns like "(Stage x.y)", the regular expressions cannot be match > as we expected, ending up with corresponding row cannot be highlighted even > though we move the cursor onto the job on Web UI's timeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15186) Add user guide for Generalized Linear Regression.
[ https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277293#comment-15277293 ] Seth Hendrickson commented on SPARK-15186: -- I have a PR for this once [SPARK-14979|https://issues.apache.org/jira/browse/SPARK-14979] is merged. > Add user guide for Generalized Linear Regression. > - > > Key: SPARK-15186 > URL: https://issues.apache.org/jira/browse/SPARK-15186 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Reporter: Seth Hendrickson >Priority: Minor > > We should add a user guide for the new GLR interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15235) Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline
[ https://issues.apache.org/jira/browse/SPARK-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15235: Assignee: Apache Spark > Corresponding row cannot be highlighted even though cursor is on the job on > Web UI's timeline > - > > Key: SPARK-15235 > URL: https://issues.apache.org/jira/browse/SPARK-15235 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Trivial > > To extract job descriptions and stage name, there are following regular > expressions in timeline-view.js > {code} > var jobIdText = > $($(baseElem).find(".application-timeline-content")[0]).text(); > var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1]; > ... > var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text(); > var stageIdAndAttempt = stageIdText.match("\\(Stage > (\\d+\\.\\d+)\\)")[1].split("."); > {code} > But if job descriptions include patterns like "(Job x)" or stage names > include patterns like "(Stage x.y)", the regular expressions cannot be match > as we expected, ending up with corresponding row cannot be highlighted even > though we move the cursor onto the job on Web UI's timeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15235) Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline
[ https://issues.apache.org/jira/browse/SPARK-15235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277295#comment-15277295 ] Apache Spark commented on SPARK-15235: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/13016 > Corresponding row cannot be highlighted even though cursor is on the job on > Web UI's timeline > - > > Key: SPARK-15235 > URL: https://issues.apache.org/jira/browse/SPARK-15235 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.0 >Reporter: Kousuke Saruta >Priority: Trivial > > To extract job descriptions and stage name, there are following regular > expressions in timeline-view.js > {code} > var jobIdText = > $($(baseElem).find(".application-timeline-content")[0]).text(); > var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1]; > ... > var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text(); > var stageIdAndAttempt = stageIdText.match("\\(Stage > (\\d+\\.\\d+)\\)")[1].split("."); > {code} > But if job descriptions include patterns like "(Job x)" or stage names > include patterns like "(Stage x.y)", the regular expressions cannot be match > as we expected, ending up with corresponding row cannot be highlighted even > though we move the cursor onto the job on Web UI's timeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15235) Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline
Kousuke Saruta created SPARK-15235: -- Summary: Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline Key: SPARK-15235 URL: https://issues.apache.org/jira/browse/SPARK-15235 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.6.1, 2.0.0 Reporter: Kousuke Saruta Priority: Trivial To extract job descriptions and stage name, there are following regular expressions in timeline-view.js {code} var jobIdText = $($(baseElem).find(".application-timeline-content")[0]).text(); var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1]; ... var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text(); var stageIdAndAttempt = stageIdText.match("\\(Stage (\\d+\\.\\d+)\\)")[1].split("."); {code} But if job descriptions include patterns like "(Job x)" or stage names include patterns like "(Stage x.y)", the regular expressions cannot be match as we expected, ending up with corresponding row cannot be highlighted even though we move the cursor onto the job on Web UI's timeline. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
[ https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277263#comment-15277263 ] Apache Spark commented on SPARK-15234: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/13014 > spark.catalog.listDatabases.show() is not formatted correctly > - > > Key: SPARK-15234 > URL: https://issues.apache.org/jira/browse/SPARK-15234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > {code} > scala> spark.catalog.listDatabases.show() > ++---+---+ > |name|description|locationUri| > ++---+---+ > |Database[name='de...| > |Database[name='my...| > |Database[name='so...| > ++---+---+ > {code} > It's because org.apache.spark.sql.catalog.Database is not a case class! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14995) Add "since" tag in Roxygen documentation for SparkR API methods
[ https://issues.apache.org/jira/browse/SPARK-14995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277260#comment-15277260 ] Felix Cheung commented on SPARK-14995: -- should we have Experimental for new API like in Python? note:: Experimental > Add "since" tag in Roxygen documentation for SparkR API methods > --- > > Key: SPARK-14995 > URL: https://issues.apache.org/jira/browse/SPARK-14995 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui > > This is request adding something in SparkR API like "versionadded" in PySpark > API and "@since" in Scala/Java API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation
[ https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277243#comment-15277243 ] Sandeep Singh commented on SPARK-15228: --- The Code is formatted properly at the end. > pyspark.RDD.toLocalIterator Documentation > - > > Key: SPARK-15228 > URL: https://issues.apache.org/jira/browse/SPARK-15228 > Project: Spark > Issue Type: Documentation >Reporter: Ignacio Tartavull >Priority: Trivial > > There is a little bug in the parsing of the documentation of > http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
[ https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-15234: - Assignee: Andrew Or > spark.catalog.listDatabases.show() is not formatted correctly > - > > Key: SPARK-15234 > URL: https://issues.apache.org/jira/browse/SPARK-15234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > {code} > scala> spark.catalog.listDatabases.show() > ++---+---+ > |name|description|locationUri| > ++---+---+ > |Database[name='de...| > |Database[name='my...| > |Database[name='so...| > ++---+---+ > {code} > It's because org.apache.spark.sql.catalog.Database is not a case class! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
[ https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277237#comment-15277237 ] Apache Spark commented on SPARK-15234: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/13015 > spark.catalog.listDatabases.show() is not formatted correctly > - > > Key: SPARK-15234 > URL: https://issues.apache.org/jira/browse/SPARK-15234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > {code} > scala> spark.catalog.listDatabases.show() > ++---+---+ > |name|description|locationUri| > ++---+---+ > |Database[name='de...| > |Database[name='my...| > |Database[name='so...| > ++---+---+ > {code} > It's because org.apache.spark.sql.catalog.Database is not a case class! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14983) Getting CompileException when feed an UDF with an array and another paramter
[ https://issues.apache.org/jira/browse/SPARK-14983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277221#comment-15277221 ] Shixiong Zhu commented on SPARK-14983: -- In addition, the Scala type for `ArrayType[String]` is `Seq[String]`. You should use it instead of `Array[String]` as the UDF parameter type. > Getting CompileException when feed an UDF with an array and another paramter > > > Key: SPARK-14983 > URL: https://issues.apache.org/jira/browse/SPARK-14983 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.5.0 > Environment: Spark 1.5.0 on EMR >Reporter: Sky Yin > > I'm trying to apply an UDF to an array column (generated from the {{array}} > funciton in SparkSQL) and another string column and got this error: > (some long source code before this...) > {code} > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:392) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:412) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:409) > at > org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 31 more > Caused by: org.codehaus.commons.compiler.CompileException: Line 1038, Column > 28: Redefinition of local variable "values" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2554) > at org.codehaus.janino.UnitCompiler.access$4300(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$6.visitLocalVariableDeclarationStatement(UnitCompiler.java:2434) > at > org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:2508) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2437) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2395) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2250) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:822) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:794) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:507) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:658) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:662) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:350) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1035) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:769) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:532) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:393) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:347) > at > org.codehaus.janino.Java$PackageMemberClassDeclaration.accept(Java.java:1139) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:322) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:383) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:315) > at > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:233) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:77) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:387) > ... 35 more > {code} > I tried to create another UDF with an array as the only parameter and that > worked well as the array got passed into the Python function as a list > without any problem. > Also I tried to define an UDF with an array and a string as paramters in > Scala and got the same {{org.codehaus.commons.com
[jira] [Assigned] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
[ https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15234: Assignee: (was: Apache Spark) > spark.catalog.listDatabases.show() is not formatted correctly > - > > Key: SPARK-15234 > URL: https://issues.apache.org/jira/browse/SPARK-15234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > {code} > scala> spark.catalog.listDatabases.show() > ++---+---+ > |name|description|locationUri| > ++---+---+ > |Database[name='de...| > |Database[name='my...| > |Database[name='so...| > ++---+---+ > {code} > It's because org.apache.spark.sql.catalog.Database is not a case class! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
[ https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15234: Assignee: Apache Spark > spark.catalog.listDatabases.show() is not formatted correctly > - > > Key: SPARK-15234 > URL: https://issues.apache.org/jira/browse/SPARK-15234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > > {code} > scala> spark.catalog.listDatabases.show() > ++---+---+ > |name|description|locationUri| > ++---+---+ > |Database[name='de...| > |Database[name='my...| > |Database[name='so...| > ++---+---+ > {code} > It's because org.apache.spark.sql.catalog.Database is not a case class! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
[ https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277205#comment-15277205 ] Apache Spark commented on SPARK-15234: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/13014 > spark.catalog.listDatabases.show() is not formatted correctly > - > > Key: SPARK-15234 > URL: https://issues.apache.org/jira/browse/SPARK-15234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > {code} > scala> spark.catalog.listDatabases.show() > ++---+---+ > |name|description|locationUri| > ++---+---+ > |Database[name='de...| > |Database[name='my...| > |Database[name='so...| > ++---+---+ > {code} > It's because org.apache.spark.sql.catalog.Database is not a case class! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14983) Getting CompileException when feed an UDF with an array and another paramter
[ https://issues.apache.org/jira/browse/SPARK-14983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-14983. -- Resolution: Duplicate This has been fixed in SPARK-10461 > Getting CompileException when feed an UDF with an array and another paramter > > > Key: SPARK-14983 > URL: https://issues.apache.org/jira/browse/SPARK-14983 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.5.0 > Environment: Spark 1.5.0 on EMR >Reporter: Sky Yin > > I'm trying to apply an UDF to an array column (generated from the {{array}} > funciton in SparkSQL) and another string column and got this error: > (some long source code before this...) > {code} > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:392) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:412) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:409) > at > org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 31 more > Caused by: org.codehaus.commons.compiler.CompileException: Line 1038, Column > 28: Redefinition of local variable "values" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2554) > at org.codehaus.janino.UnitCompiler.access$4300(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$6.visitLocalVariableDeclarationStatement(UnitCompiler.java:2434) > at > org.codehaus.janino.Java$LocalVariableDeclarationStatement.accept(Java.java:2508) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2437) > at > org.codehaus.janino.UnitCompiler.buildLocalVariableMap(UnitCompiler.java:2395) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2250) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:822) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:794) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:507) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:658) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:662) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:350) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1035) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:769) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:532) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:393) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:347) > at > org.codehaus.janino.Java$PackageMemberClassDeclaration.accept(Java.java:1139) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:354) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:322) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:383) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:315) > at > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:233) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:192) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:84) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:77) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:387) > ... 35 more > {code} > I tried to create another UDF with an array as the only parameter and that > worked well as the array got passed into the Python function as a list > without any problem. > Also I tried to define an UDF with an array and a string as paramters in > Scala and got the same {{org.codehaus.commons.compiler.CompileException: Line > 1038, Column 28: Redefinition of local variable "values"}} error. -- This message was sent by Atlassia
[jira] [Created] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
Andrew Or created SPARK-15234: - Summary: spark.catalog.listDatabases.show() is not formatted correctly Key: SPARK-15234 URL: https://issues.apache.org/jira/browse/SPARK-15234 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or {code} scala> spark.catalog.listDatabases.show() ++---+---+ |name|description|locationUri| ++---+---+ |Database[name='de...| |Database[name='my...| |Database[name='so...| ++---+---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15234) spark.catalog.listDatabases.show() is not formatted correctly
[ https://issues.apache.org/jira/browse/SPARK-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15234: -- Description: {code} scala> spark.catalog.listDatabases.show() ++---+---+ |name|description|locationUri| ++---+---+ |Database[name='de...| |Database[name='my...| |Database[name='so...| ++---+---+ {code} It's because org.apache.spark.sql.catalog.Database is not a case class! was: {code} scala> spark.catalog.listDatabases.show() ++---+---+ |name|description|locationUri| ++---+---+ |Database[name='de...| |Database[name='my...| |Database[name='so...| ++---+---+ {code} > spark.catalog.listDatabases.show() is not formatted correctly > - > > Key: SPARK-15234 > URL: https://issues.apache.org/jira/browse/SPARK-15234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or > > {code} > scala> spark.catalog.listDatabases.show() > ++---+---+ > |name|description|locationUri| > ++---+---+ > |Database[name='de...| > |Database[name='my...| > |Database[name='so...| > ++---+---+ > {code} > It's because org.apache.spark.sql.catalog.Database is not a case class! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15233) Spark task metrics should include hdfs read write latency
[ https://issues.apache.org/jira/browse/SPARK-15233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-15233: Affects Version/s: 1.6.1 Priority: Minor (was: Major) Description: Currently the Spark task metrics does not have hdfs read/write latency. It will be very useful to have those to find the bottleneck in the query. Issue Type: Improvement (was: Bug) Summary: Spark task metrics should include hdfs read write latency (was: Spark UI should show metrics for hdfs read write latency) > Spark task metrics should include hdfs read write latency > - > > Key: SPARK-15233 > URL: https://issues.apache.org/jira/browse/SPARK-15233 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.1 >Reporter: Sital Kedia >Priority: Minor > > Currently the Spark task metrics does not have hdfs read/write latency. It > will be very useful to have those to find the bottleneck in the query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15233) Spark UI should show metrics for hdfs read write latency
Sital Kedia created SPARK-15233: --- Summary: Spark UI should show metrics for hdfs read write latency Key: SPARK-15233 URL: https://issues.apache.org/jira/browse/SPARK-15233 Project: Spark Issue Type: Bug Reporter: Sital Kedia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15229) Make case sensitivity setting internal
[ https://issues.apache.org/jira/browse/SPARK-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15229: Description: Our case sensitivity support is different from what ANSI SQL standards support. Postgres' behavior is that if an identifier is quoted, then it is treated as case sensitive, otherwise it is folded to lower case. We will likely need to revisit this in the future and change our behavior. For now, the safest change to do for Spark 2.0 is to make the case sensitive option internal and discourage users from turning it on, effectively making Spark always case insensitive. was: Our case sensitivity support is different from what ANSI SQL standards support. Postgres' behavior is that if an identifier is quoted, then it is treated as case sensitive, otherwise it is folded to lower case. We will likely need to revisit this in the future and change our behavior. For now, the safest change to do for Spark 2.0 is to make the case sensitive option internal and recommend users from not turning it on, effectively making Spark always case insensitive. > Make case sensitivity setting internal > -- > > Key: SPARK-15229 > URL: https://issues.apache.org/jira/browse/SPARK-15229 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Our case sensitivity support is different from what ANSI SQL standards > support. Postgres' behavior is that if an identifier is quoted, then it is > treated as case sensitive, otherwise it is folded to lower case. We will > likely need to revisit this in the future and change our behavior. For now, > the safest change to do for Spark 2.0 is to make the case sensitive option > internal and discourage users from turning it on, effectively making Spark > always case insensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14898) MultivariateGaussian could use Cholesky in calculateCovarianceConstants
[ https://issues.apache.org/jira/browse/SPARK-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14898. --- Resolution: Not A Problem OK someone reopen this if we've misunderstood, but I think this isn't an issue. > MultivariateGaussian could use Cholesky in calculateCovarianceConstants > --- > > Key: SPARK-14898 > URL: https://issues.apache.org/jira/browse/SPARK-14898 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml.stat.distribution.MultivariateGaussian, > calculateCovarianceConstants uses SVD. It might be more efficient to use > Cholesky. We should check other numerical libraries and see if we should > switch to Cholesky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15228) pyspark.RDD.toLocalIterator Documentation
[ https://issues.apache.org/jira/browse/SPARK-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277073#comment-15277073 ] Sean Owen commented on SPARK-15228: --- (What's the bug?) > pyspark.RDD.toLocalIterator Documentation > - > > Key: SPARK-15228 > URL: https://issues.apache.org/jira/browse/SPARK-15228 > Project: Spark > Issue Type: Documentation >Reporter: Ignacio Tartavull >Priority: Trivial > > There is a little bug in the parsing of the documentation of > http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.toLocalIterator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15231) Document the semantic of saveAsTable and insertInto and don't drop columns silently
[ https://issues.apache.org/jira/browse/SPARK-15231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277003#comment-15277003 ] Apache Spark commented on SPARK-15231: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/13013 > Document the semantic of saveAsTable and insertInto and don't drop columns > silently > --- > > Key: SPARK-15231 > URL: https://issues.apache.org/jira/browse/SPARK-15231 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > We should document the difference between "insertInto" and "saveAsTable": > "insertInto" is using the position-based column resolution but "saveAsTable > with append" is using name based resolution. > In addition, for "saveAsTable with append", when try to add more columns, we > just drop the new columns silently. The correct behavior should be throwing > an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15231) Document the semantic of saveAsTable and insertInto and don't drop columns silently
[ https://issues.apache.org/jira/browse/SPARK-15231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15231: Assignee: Shixiong Zhu (was: Apache Spark) > Document the semantic of saveAsTable and insertInto and don't drop columns > silently > --- > > Key: SPARK-15231 > URL: https://issues.apache.org/jira/browse/SPARK-15231 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > We should document the difference between "insertInto" and "saveAsTable": > "insertInto" is using the position-based column resolution but "saveAsTable > with append" is using name based resolution. > In addition, for "saveAsTable with append", when try to add more columns, we > just drop the new columns silently. The correct behavior should be throwing > an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15232) Add subquery SQL building tests to LogicalPlanToSQLSuite
Herman van Hovell created SPARK-15232: - Summary: Add subquery SQL building tests to LogicalPlanToSQLSuite Key: SPARK-15232 URL: https://issues.apache.org/jira/browse/SPARK-15232 Project: Spark Issue Type: Improvement Components: SQL Reporter: Herman van Hovell Assignee: Herman van Hovell Priority: Minor We currently test subquery SQL building using the {{HiveCompatibilitySuite}}. The is not desired since SQL building is actually a part of sql/core and because we are slowly reducing our dependency on Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15231) Document the semantic of saveAsTable and insertInto and don't drop columns silently
[ https://issues.apache.org/jira/browse/SPARK-15231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15231: Assignee: Apache Spark (was: Shixiong Zhu) > Document the semantic of saveAsTable and insertInto and don't drop columns > silently > --- > > Key: SPARK-15231 > URL: https://issues.apache.org/jira/browse/SPARK-15231 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Apache Spark > > We should document the difference between "insertInto" and "saveAsTable": > "insertInto" is using the position-based column resolution but "saveAsTable > with append" is using name based resolution. > In addition, for "saveAsTable with append", when try to add more columns, we > just drop the new columns silently. The correct behavior should be throwing > an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14318) TPCDS query 14 causes Spark SQL to hang
[ https://issues.apache.org/jira/browse/SPARK-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trystan Leftwich updated SPARK-14318: - Description: TPCDS Q14 parses successfully, and plans created successfully. Spark tries to run (I used only 1GB text file), but "hangs". Tasks are extremely slow to process AND all CPUs are used 100% by the executor JVMs. It is very easy to reproduce: 1. Use the spark-sql CLI to run the query 14 (TPCDS) against a database of 1GB text file (assuming you know how to generate the csv data). My command is like this: {noformat} /TestAutomation/downloads/spark-master/bin/spark-sql --driver-memory 10g --verbose --master yarn-client --packages com.databricks:spark-csv_2.10:1.3.0 --executor-memory 8g --num-executors 4 --executor-cores 4 --conf spark.sql.join.preferSortMergeJoin=true --database hadoopds1g -f $f > q14.o{noformat} The Spark console output: {noformat} 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 17.0 (TID 65, bigaperf138.svl.ibm.com, partition 26,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:37 INFO cluster.YarnClientSchedulerBackend: Launching task 65 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:37 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 17.0 (TID 62) in 829687 ms on bigaperf138.svl.ibm.com (15/200) 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 17.0 (TID 66, bigaperf138.svl.ibm.com, partition 27,RACK_LOCAL, 4515 bytes) 16/03/31 15:45:52 INFO cluster.YarnClientSchedulerBackend: Launching task 66 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:45:52 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 17.0 (TID 65) in 15505 ms on bigaperf138.svl.ibm.com (16/200) 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 17.0 (TID 67, bigaperf138.svl.ibm.com, partition 28,RACK_LOCAL, 4515 bytes) 16/03/31 15:46:17 INFO cluster.YarnClientSchedulerBackend: Launching task 67 on executor id: 4 hostname: bigaperf138.svl.ibm.com. 16/03/31 15:46:17 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 17.0 (TID 66) in 24929 ms on bigaperf138.svl.ibm.com (17/200) 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 17.0 (TID 68, bigaperf137.svl.ibm.com, partition 29,NODE_LOCAL, 4515 bytes) 16/03/31 15:51:53 INFO cluster.YarnClientSchedulerBackend: Launching task 68 on executor id: 2 hostname: bigaperf137.svl.ibm.com. 16/03/31 15:51:53 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 17.0 (TID 47) in 3775585 ms on bigaperf137.svl.ibm.com (18/200) {noformat} Notice that time durations between tasks are unusually long: 2~5 minutes. When looking at the Linux 'perf' tool, two top CPU consumers are: 86.48%java [unknown] 12.41%libjvm.so Using the Java hotspot profiling tools, I am able to show what hotspot methods are (top 5): {noformat} org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() 46.845276 9,654,179 ms (46.8%)9,654,179 ms9,654,179 ms 9,654,179 ms org.apache.spark.unsafe.Platform.copyMemory() 18.631157 3,848,442 ms (18.6%)3,848,442 ms3,848,442 ms3,848,442 ms org.apache.spark.util.collection.CompactBuffer.$plus$eq() 6.8570185 1,418,411 ms (6.9%) 1,418,411 ms1,517,960 ms1,517,960 ms org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeValue() 4.6126328 955,495 ms (4.6%) 955,495 ms 2,153,910 ms 2,153,910 ms org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write() 4.581077949,930 ms (4.6%) 949,930 ms 19,967,510 ms 19,967,510 ms {noformat} So as you can see, the test has been running for 1.5 hours...with 46% CPU spent in the org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten() method. The stacks for top two are: {noformat} Marshalling I java/io/DataOutputStream.writeInt() line 197 org.apache.spark.sql I org/apache/spark/sql/execution/UnsafeRowSerializerInstance$$anon$2.writeValue() line 60 org.apache.spark.storage I org/apache/spark/storage/DiskBlockObjectWriter.write() line 185 org.apache.spark.shuffle I org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.write() line 150 org.apache.spark.scheduler I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 78 I org/apache/spark/scheduler/ShuffleMapTask.runTask() line 46 I org/apache/spark/scheduler/Task.run() line 82 org.apache.spark.executor I org/apache/spark/executor/Executor$TaskRunner.run() line 231 Dispatching Overhead, Standard Library Worker Dispatching I java/util/concurrent/ThreadPoolExecutor.runWorker() line 1142 I java/util/concurrent/ThreadPoolExecutor$Worker.run() line 617 I java/lang/Thread.run() line 745 {noformat} and {noformat} org.apache.spark.unsafe I org/apache/spark
[jira] [Created] (SPARK-15231) Document the semantic of saveAsTable and insertInto and don't drop columns silently
Shixiong Zhu created SPARK-15231: Summary: Document the semantic of saveAsTable and insertInto and don't drop columns silently Key: SPARK-15231 URL: https://issues.apache.org/jira/browse/SPARK-15231 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu We should document the difference between "insertInto" and "saveAsTable": "insertInto" is using the position-based column resolution but "saveAsTable with append" is using name based resolution. In addition, for "saveAsTable with append", when try to add more columns, we just drop the new columns silently. The correct behavior should be throwing an exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org