[GitHub] spark pull request: [SPARK-15313][SQL] EmbedSerializerInFilter rul...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13096 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15360][Spark-Submit]Should print spark-...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13163#issuecomment-220523618 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58937/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15313][SQL] EmbedSerializerInFilter rul...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/13096#issuecomment-220523585 Alright I'm going to merge this in master/2.0. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15360][Spark-Submit]Should print spark-...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13163#issuecomment-220523617 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15114][SQL] Column name generated by ty...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/13045#discussion_r63994025 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala --- @@ -37,6 +38,14 @@ private[sql] object Column { def apply(expr: Expression): Column = new Column(expr) def unapply(col: Column): Option[Expression] = Some(col.expr) + + private[sql] def generateAlias(e: Expression, index: Int): String = { +e match { + case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] => +s"${a.aggregateFunction.prettyName}_c${index}" --- End diff -- @cloud-fan Looks like following. Lets go with this ? I will drop the index parameter. ```SQL +---+---+ |TypedSumDouble(int)|TypedSumDouble(int)| +---+---+ | 11.0| 11.0| +---+---+ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15360][Spark-Submit]Should print spark-...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13163#issuecomment-220523513 **[Test build #58937 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58937/consoleFull)** for PR 13163 at commit [`2941e62`](https://github.com/apache/spark/commit/2941e6273d064376f0e540fa0655c345d9c52461). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user xwu0226 commented on the pull request: https://github.com/apache/spark/pull/13212#issuecomment-220523283 @rxin Updated the code. Please help double check. Thank you very much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15335] [SQL] Implement TRUNCATE TABLE C...
Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/13170#issuecomment-220523237 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15379][SQL] check special invalid date
Github user wangyang1992 commented on the pull request: https://github.com/apache/spark/pull/13169#issuecomment-220523060 Addressed your comments. @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Revert "[SPARK-10216][SQL] Avoid creating empt...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/13181#issuecomment-220522607 @marmbrus I tested and could produce the exceptions for reading in https://issues.apache.org/jira/browse/SPARK-15393 but it seems this might not be the reason. I tested the codes below on https://github.com/apache/spark/commit/c0c3ec35476c756e569a1f34c4b258eb0490585c (right before this PR) and master branch. ```scala test("SPARK-15393: create empty file") { withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") { withTempPath { path => val schema = StructType( StructField("k", StringType, true) :: StructField("v", IntegerType, false) :: Nil) val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) emptyDf.write .format("parquet") .save(path.getCanonicalPath) val copyEmptyDf = spark.read .format("parquet") .load(path.getCanonicalPath) copyEmptyDf.show() } } } ``` and it seems both produce the exceptions below: ```scala Unable to infer schema for ParquetFormat at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-98dfbe86-afca-413e-9be7-46ff18bac443. It must be specified manually; org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-98dfbe86-afca-413e-9be7-46ff18bac443. It must be specified manually; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:324) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:324) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:323) ``` I will try to figure out why but please feel free to revert this if you think my PR is problematic. I will fix the both issues together anyway later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15165][SPARK-15205][SQL] Introduce plac...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12979#issuecomment-220522052 **[Test build #58945 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58945/consoleFull)** for PR 12979 at commit [`6ebfa10`](https://github.com/apache/spark/commit/6ebfa10d63a2234dae4e567f06279f5e7feb3df9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MINOR] Fix Typos
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/13078#issuecomment-220521732 @srowen we should backport the doc fixes into branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15206][SQL] add testcases for distinct ...
Github user xwu0226 commented on the pull request: https://github.com/apache/spark/pull/12984#issuecomment-220521657 @cloud-fan Please see if we should add these test cases to 2.0 branch. It is related to the distinct aggregate in having clause. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15075][SPARK-15345][SQL] Clean up Spark...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/13200#discussion_r63993278 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -735,29 +731,130 @@ object SparkSession { } /** - * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new one - * based on the options set in this builder. + * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new + * one based on the options set in this builder. + * + * This method first checks whether there is a valid thread-local SparkSession, + * and if yes, return that one. It then checks whether there is a valid global + * default SparkSession, and if yes, return that one. If no valid global default + * SparkSession exists, the method creates a new SparkSession and assigns the + * newly created SparkSession as the global default. + * + * In case an existing SparkSession is returned, the config options specified in + * this builder will be applied to the existing SparkSession. * * @since 2.0.0 */ def getOrCreate(): SparkSession = synchronized { - // Step 1. Create a SparkConf - // Step 2. Get a SparkContext - // Step 3. Get a SparkSession - val sparkConf = new SparkConf() - options.foreach { case (k, v) => sparkConf.set(k, v) } - val sparkContext = SparkContext.getOrCreate(sparkConf) - - SQLContext.getOrCreate(sparkContext).sparkSession + // Get the session from current thread's active session. + var session = activeThreadSession.get() + if ((session ne null) && !session.sparkContext.isStopped) { +options.foreach { case (k, v) => session.conf.set(k, v) } +return session + } + + // Global synchronization so we will only set the default session once. + SparkSession.synchronized { +// If the current thread does not have an active session, get it from the global session. +session = defaultSession.get() +if ((session ne null) && !session.sparkContext.isStopped) { + options.foreach { case (k, v) => session.conf.set(k, v) } + return session +} + +// No active nor global default session. Create a new one. +val sparkContext = userSuppliedContext.getOrElse { + // set app name if not given + if (!options.contains("spark.app.name")) { +options += "spark.app.name" -> java.util.UUID.randomUUID().toString + } + + val sparkConf = new SparkConf() + options.foreach { case (k, v) => sparkConf.set(k, v) } + SparkContext.getOrCreate(sparkConf) +} +session = new SparkSession(sparkContext) +options.foreach { case (k, v) => session.conf.set(k, v) } +defaultSession.set(session) --- End diff -- @rxin Ok. Got it. Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8603][SPARKR] Incorrect file separator ...
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/13165#issuecomment-220521666 1. A rough scan of the test failures shows most of them are probably related to path handling. You can replay the failed test case in R on Windows. For debug facilities in R, refer to http://www.inside-r.org/r-doc/base/traceback, https://stat.ethz.ch/R-manual/R-devel/library/base/html/debug.html, http://www.inside-r.org/r-doc/base/browser 2. You can add a new file named test_Windows.R under R/pkg/inst/tests/testthat 3. Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [DOC][MINOR] ml.feature Scala and Python API s...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/13159#issuecomment-220521612 @MLnick did you actually merge this in 2.0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15165][SPARK-15205][SQL] Introduce plac...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12979#issuecomment-220521459 **[Test build #58944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58944/consoleFull)** for PR 12979 at commit [`3c49567`](https://github.com/apache/spark/commit/3c495670716ecf63d21110f7c7ee93500051d26a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15075][SPARK-15345][SQL] Clean up Spark...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13200#discussion_r63993145 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -735,29 +731,130 @@ object SparkSession { } /** - * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new one - * based on the options set in this builder. + * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new + * one based on the options set in this builder. + * + * This method first checks whether there is a valid thread-local SparkSession, + * and if yes, return that one. It then checks whether there is a valid global + * default SparkSession, and if yes, return that one. If no valid global default + * SparkSession exists, the method creates a new SparkSession and assigns the + * newly created SparkSession as the global default. + * + * In case an existing SparkSession is returned, the config options specified in + * this builder will be applied to the existing SparkSession. * * @since 2.0.0 */ def getOrCreate(): SparkSession = synchronized { - // Step 1. Create a SparkConf - // Step 2. Get a SparkContext - // Step 3. Get a SparkSession - val sparkConf = new SparkConf() - options.foreach { case (k, v) => sparkConf.set(k, v) } - val sparkContext = SparkContext.getOrCreate(sparkConf) - - SQLContext.getOrCreate(sparkContext).sparkSession + // Get the session from current thread's active session. + var session = activeThreadSession.get() + if ((session ne null) && !session.sparkContext.isStopped) { +options.foreach { case (k, v) => session.conf.set(k, v) } +return session + } + + // Global synchronization so we will only set the default session once. + SparkSession.synchronized { +// If the current thread does not have an active session, get it from the global session. +session = defaultSession.get() +if ((session ne null) && !session.sparkContext.isStopped) { + options.foreach { case (k, v) => session.conf.set(k, v) } + return session +} + +// No active nor global default session. Create a new one. +val sparkContext = userSuppliedContext.getOrElse { + // set app name if not given + if (!options.contains("spark.app.name")) { +options += "spark.app.name" -> java.util.UUID.randomUUID().toString + } + + val sparkConf = new SparkConf() + options.foreach { case (k, v) => sparkConf.set(k, v) } + SparkContext.getOrCreate(sparkConf) +} +session = new SparkSession(sparkContext) +options.foreach { case (k, v) => session.conf.set(k, v) } +defaultSession.set(session) --- End diff -- We would create a new one in that case ... I'm not too worried about the legacy corner cases here though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/13188#discussion_r63993020 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark.tpcds + +import java.io.File + +import org.apache.spark.SparkConf +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.Benchmark + +/** + * Benchmark to measure TPCDS query performance. + * To run this: + * spark-submit --class --jars + */ +object TPCDSQueryBenchmark { + val conf = +new SparkConf() + .setMaster("local[1]") + .setAppName("test-sql-context") + .set("spark.sql.parquet.compression.codec", "snappy") + .set("spark.sql.shuffle.partitions", "4") + .set("spark.driver.memory", "3g") + .set("spark.executor.memory", "3g") + .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString) + + val spark = SparkSession.builder.config(conf).getOrCreate() + + val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address", +"customer_demographics", "date_dim", "household_demographics", "inventory", "item", +"promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales", +"web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band", +"time_dim", "web_page") + + def setupTables(dataLocation: String): Map[String, Long] = { +tables.map { tableName => + spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName) + tableName -> spark.table(tableName).count() +}.toMap + } + + def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = { +require(dataLocation.nonEmpty, + "please modify the value of dataLocation to point to your local TPCDS data") +val tableSizes = setupTables(dataLocation) +spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true") +spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true") +queries.foreach { name => + val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" + +s"execution/benchmark/tpcds/queries/$name.sql")) + + // This is an indirect hack to estimate the size of each query's input by traversing the + // logical plan and adding up the sizes of all tables that appear in the plan. Note that this + // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate + // per-row processing time for those cases. + val queryRelations = scala.collection.mutable.HashSet[String]() + spark.sql(queriesString).queryExecution.logical.map { +case ur @ UnresolvedRelation(t: TableIdentifier, _) => + queryRelations.add(t.table) +case lp: LogicalPlan => + lp.expressions.foreach { _ foreach { +case subquery: SubqueryExpression => + subquery.plan.foreach { +case ur @ UnresolvedRelation(t: TableIdentifier, _) => + queryRelations.add(t.table) +case _ => + } +case _ => + } +} +case _ => + } + val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum + val benchmark = new Benchmark("TPCDS Snappy", numRows, 5) +
[GitHub] spark pull request: [SPARK-15313][SQL] EmbedSerializerInFilter rul...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/13096#issuecomment-220521107 LGTM, thanks for finding this bug! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15057][GRAPHX] Remove stale TODO commen...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/12839#issuecomment-220521039 Since this is very low risk, I'm going to cherry-pick this in branch-2.0 too to minimize the diff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15313][SQL] EmbedSerializerInFilter rul...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13096#discussion_r63992991 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1560,7 +1561,14 @@ object EmbedSerializerInFilter extends Rule[LogicalPlan] { val newCondition = condition transform { case a: Attribute if a == d.output.head => d.deserializer } -Filter(newCondition, d.child) +val filter = Filter(newCondition, d.child) + +// Adds an extra Project here, to preserve the output expr id of `SerializeFromObject`. +// We will remove it later in RemoveAliasOnlyProject rule. +val objAttrs = filter.output.zip(s.output).map { case (fout, sout) => --- End diff -- I'd say it's not object attributes, maybe we should just name it `attrs` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15308][SQL] RowEncoder should preserve ...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/13090#issuecomment-220520673 LGTM except one style comment, thanks for working on it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15308][SQL] RowEncoder should preserve ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13090#discussion_r63992806 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala --- @@ -149,12 +149,12 @@ object RowEncoder { dataType = t) case StructType(fields) => - val convertedFields = fields.zipWithIndex.map { case (f, i) => + val convertedFields = fields.zipWithIndex.flatMap { case (f, i) => --- End diff -- We can follow the style in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L527: ``` val nonNullOutput = CreateNamedStruct(fields.zipWithIndex.flatMap { case (field, index) => ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15075][SPARK-15345][SQL] Clean up Spark...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/13200#discussion_r63992730 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala --- @@ -735,29 +731,130 @@ object SparkSession { } /** - * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new one - * based on the options set in this builder. + * Gets an existing [[SparkSession]] or, if there is no existing one, creates a new + * one based on the options set in this builder. + * + * This method first checks whether there is a valid thread-local SparkSession, + * and if yes, return that one. It then checks whether there is a valid global + * default SparkSession, and if yes, return that one. If no valid global default + * SparkSession exists, the method creates a new SparkSession and assigns the + * newly created SparkSession as the global default. + * + * In case an existing SparkSession is returned, the config options specified in + * this builder will be applied to the existing SparkSession. * * @since 2.0.0 */ def getOrCreate(): SparkSession = synchronized { - // Step 1. Create a SparkConf - // Step 2. Get a SparkContext - // Step 3. Get a SparkSession - val sparkConf = new SparkConf() - options.foreach { case (k, v) => sparkConf.set(k, v) } - val sparkContext = SparkContext.getOrCreate(sparkConf) - - SQLContext.getOrCreate(sparkContext).sparkSession + // Get the session from current thread's active session. + var session = activeThreadSession.get() + if ((session ne null) && !session.sparkContext.isStopped) { +options.foreach { case (k, v) => session.conf.set(k, v) } +return session + } + + // Global synchronization so we will only set the default session once. + SparkSession.synchronized { +// If the current thread does not have an active session, get it from the global session. +session = defaultSession.get() +if ((session ne null) && !session.sparkContext.isStopped) { + options.foreach { case (k, v) => session.conf.set(k, v) } + return session +} + +// No active nor global default session. Create a new one. +val sparkContext = userSuppliedContext.getOrElse { + // set app name if not given + if (!options.contains("spark.app.name")) { +options += "spark.app.name" -> java.util.UUID.randomUUID().toString + } + + val sparkConf = new SparkConf() + options.foreach { case (k, v) => sparkConf.set(k, v) } + SparkContext.getOrCreate(sparkConf) +} +session = new SparkSession(sparkContext) +options.foreach { case (k, v) => session.conf.set(k, v) } +defaultSession.set(session) --- End diff -- @rxin Hi Reynold, i had a minor question just for my understanding. When users do a new SQLContext() , we create a implicit SparkSession. Should this session be made the defaultSession ? If we call, 1) new SQLContext 2) builder.getOrCreate() then whats the expected behaviour ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14261][SQL] Memory leak in Spark Thrift...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12932 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14261][SQL] Memory leak in Spark Thrift...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/12932#issuecomment-220519882 I'm going to merge this in master/2.0/1.6. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15433][PySpark] PySpark core test shoul...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13214#issuecomment-220519810 **[Test build #58943 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58943/consoleFull)** for PR 13214 at commit [`760a4cd`](https://github.com/apache/spark/commit/760a4cda44585c6039fd8954fc43d57174a5cf27). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15282][SQL] PushDownPredicate should no...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13087#discussion_r63992511 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1025,7 +1025,8 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper { // state and all the input rows processed before. In another word, the order of input rows // matters for non-deterministic expressions, while pushing down predicates changes the order. case filter @ Filter(condition, project @ Project(fields, grandChild)) - if fields.forall(_.deterministic) => +if fields.forall(_.deterministic) && + fields.forall(_.find(_.isInstanceOf[ScalaUDF]).isEmpty) => --- End diff -- I'm not sure if I understand this correctly, do you mean `ScalaUDF` can be nondeterministic and we should always treat it as nondeterministic expression? If so, I think a better idea is just override `deterministic` in `ScalaUDF` and always return false. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15313][SQL] EmbedSerializerInFilter rul...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/13096#issuecomment-220519361 cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15424][SQL] Revert SPARK-14807 Create a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13207#issuecomment-220519275 **[Test build #58942 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58942/consoleFull)** for PR 13207 at commit [`c77616e`](https://github.com/apache/spark/commit/c77616e1ecb020e0657813fa6f14d6aa7f4688d4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13188#discussion_r63992275 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark.tpcds + +import java.io.File + +import org.apache.spark.SparkConf +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.Benchmark + +/** + * Benchmark to measure TPCDS query performance. + * To run this: + * spark-submit --class --jars + */ +object TPCDSQueryBenchmark { + val conf = +new SparkConf() + .setMaster("local[1]") + .setAppName("test-sql-context") + .set("spark.sql.parquet.compression.codec", "snappy") + .set("spark.sql.shuffle.partitions", "4") + .set("spark.driver.memory", "3g") + .set("spark.executor.memory", "3g") + .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString) + + val spark = SparkSession.builder.config(conf).getOrCreate() + + val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address", +"customer_demographics", "date_dim", "household_demographics", "inventory", "item", +"promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales", +"web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band", +"time_dim", "web_page") + + def setupTables(dataLocation: String): Map[String, Long] = { +tables.map { tableName => + spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName) + tableName -> spark.table(tableName).count() +}.toMap + } + + def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = { +require(dataLocation.nonEmpty, + "please modify the value of dataLocation to point to your local TPCDS data") +val tableSizes = setupTables(dataLocation) +spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true") +spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true") +queries.foreach { name => + val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" + --- End diff -- one thing - these files should go into test/resources, and then we can get their path using the getresource function on the current thread's classloader. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15433][PySpark] PySpark core test shoul...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/13214 [SPARK-15433][PySpark] PySpark core test should not use SerDe from PythonMLLibAPI ## What changes were proposed in this pull request? Currently PySpark core test uses the `SerDe` from `PythonMLLibAPI` which includes many MLlib things. It should use `SerDeUti`l instead. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 pycore-use-serdeutil Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13214.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13214 commit 760a4cda44585c6039fd8954fc43d57174a5cf27 Author: Liang-Chi HsiehDate: 2016-05-20T05:12:47Z PySpark core test should not use SerDe from PythonMLLibAPI. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63992232 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1387,6 +1387,27 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } /** +* Return a list of file paths that are added to resources. +* If file paths are provided, return the ones that are added to resources. +*/ + def listFiles(files: Seq[String] = Seq.empty[String]): Seq[String] = { --- End diff -- @rxin OK. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13188#discussion_r63992224 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark.tpcds + +import java.io.File + +import org.apache.spark.SparkConf +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.Benchmark + +/** + * Benchmark to measure TPCDS query performance. + * To run this: + * spark-submit --class --jars + */ +object TPCDSQueryBenchmark { + val conf = +new SparkConf() + .setMaster("local[1]") + .setAppName("test-sql-context") + .set("spark.sql.parquet.compression.codec", "snappy") + .set("spark.sql.shuffle.partitions", "4") + .set("spark.driver.memory", "3g") + .set("spark.executor.memory", "3g") + .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString) + + val spark = SparkSession.builder.config(conf).getOrCreate() + + val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address", +"customer_demographics", "date_dim", "household_demographics", "inventory", "item", +"promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales", +"web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band", +"time_dim", "web_page") + + def setupTables(dataLocation: String): Map[String, Long] = { +tables.map { tableName => + spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName) + tableName -> spark.table(tableName).count() +}.toMap + } + + def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = { +require(dataLocation.nonEmpty, + "please modify the value of dataLocation to point to your local TPCDS data") +val tableSizes = setupTables(dataLocation) +spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true") +spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true") +queries.foreach { name => + val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" + +s"execution/benchmark/tpcds/queries/$name.sql")) + + // This is an indirect hack to estimate the size of each query's input by traversing the + // logical plan and adding up the sizes of all tables that appear in the plan. Note that this + // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate + // per-row processing time for those cases. + val queryRelations = scala.collection.mutable.HashSet[String]() + spark.sql(queriesString).queryExecution.logical.map { +case ur @ UnresolvedRelation(t: TableIdentifier, _) => + queryRelations.add(t.table) +case lp: LogicalPlan => + lp.expressions.foreach { _ foreach { +case subquery: SubqueryExpression => + subquery.plan.foreach { +case ur @ UnresolvedRelation(t: TableIdentifier, _) => + queryRelations.add(t.table) +case _ => + } +case _ => + } +} +case _ => + } + val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum + val benchmark = new Benchmark("TPCDS Snappy", numRows, 5) + benchmark.addCase(name) {
[GitHub] spark pull request: [SPARK-15363][ML][Example]:Example code should...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13213#issuecomment-220519177 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58941/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14990][SQL] nvl, coalesce, array with p...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12768 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15363][ML][Example]:Example code should...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13213#issuecomment-220519176 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15363][ML][Example]:Example code should...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13213#issuecomment-220519138 **[Test build #58941 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58941/consoleFull)** for PR 13213 at commit [`818dc7f`](https://github.com/apache/spark/commit/818dc7fe8f1be835243de8d096d43b229e356cbc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14990][SQL] Fix checkForSameTypeInputEx...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13208 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63992144 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1387,6 +1387,27 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } /** +* Return a list of file paths that are added to resources. +* If file paths are provided, return the ones that are added to resources. +*/ + def listFiles(files: Seq[String] = Seq.empty[String]): Seq[String] = { --- End diff -- no they can filter themselves easily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/13188#discussion_r63992090 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark.tpcds + +import java.io.File + +import org.apache.spark.SparkConf +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.util._ +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.Benchmark + +/** + * Benchmark to measure TPCDS query performance. + * To run this: + * spark-submit --class --jars + */ +object TPCDSQueryBenchmark { + val conf = +new SparkConf() + .setMaster("local[1]") + .setAppName("test-sql-context") + .set("spark.sql.parquet.compression.codec", "snappy") + .set("spark.sql.shuffle.partitions", "4") + .set("spark.driver.memory", "3g") + .set("spark.executor.memory", "3g") + .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString) + + val spark = SparkSession.builder.config(conf).getOrCreate() + + val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address", +"customer_demographics", "date_dim", "household_demographics", "inventory", "item", +"promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales", +"web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band", +"time_dim", "web_page") + + def setupTables(dataLocation: String): Map[String, Long] = { +tables.map { tableName => + spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName) + tableName -> spark.table(tableName).count() +}.toMap + } + + def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = { +require(dataLocation.nonEmpty, + "please modify the value of dataLocation to point to your local TPCDS data") +val tableSizes = setupTables(dataLocation) +spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true") +spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true") +queries.foreach { name => + val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" + +s"execution/benchmark/tpcds/queries/$name.sql")) + + // This is an indirect hack to estimate the size of each query's input by traversing the + // logical plan and adding up the sizes of all tables that appear in the plan. Note that this + // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate + // per-row processing time for those cases. + val queryRelations = scala.collection.mutable.HashSet[String]() + spark.sql(queriesString).queryExecution.logical.map { +case ur @ UnresolvedRelation(t: TableIdentifier, _) => + queryRelations.add(t.table) +case lp: LogicalPlan => + lp.expressions.foreach { _ foreach { +case subquery: SubqueryExpression => + subquery.plan.foreach { +case ur @ UnresolvedRelation(t: TableIdentifier, _) => + queryRelations.add(t.table) +case _ => + } +case _ => + } +} +case _ => + } + val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum + val benchmark = new Benchmark("TPCDS Snappy", numRows, 5) +
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63992055 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1387,6 +1387,27 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } /** +* Return a list of file paths that are added to resources. +* If file paths are provided, return the ones that are added to resources. +*/ + def listFiles(files: Seq[String] = Seq.empty[String]): Seq[String] = { --- End diff -- @rxin Just one concern about thiss one. It is possible that users just invoked listFiles or listJars directly with sparkContext. Do we want to provide filtering for this case? Right now, I have a [test case](https://github.com/xwu0226/spark/blob/21b092ab84b22abec93fde1fc1ca177db68d9f0d/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L159-L176) that covers this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14990][SQL] Fix checkForSameTypeInputEx...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/13208#issuecomment-220518940 Merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15114][SQL] Column name generated by ty...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/13045#discussion_r63992015 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala --- @@ -37,6 +38,14 @@ private[sql] object Column { def apply(expr: Expression): Column = new Column(expr) def unapply(col: Column): Option[Expression] = Some(col.expr) + + private[sql] def generateAlias(e: Expression, index: Int): String = { +e match { + case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] => +s"${a.aggregateFunction.prettyName}_c${index}" --- End diff -- ok.. let me get the output for you and paste it here so its easier to decide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14990][SQL] Fix checkForSameTypeInputEx...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/13208#issuecomment-220518610 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15363][ML][Example]:Example code should...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13213#issuecomment-220518293 **[Test build #58941 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58941/consoleFull)** for PR 13213 at commit [`818dc7f`](https://github.com/apache/spark/commit/818dc7fe8f1be835243de8d096d43b229e356cbc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15363][ML][Example]:Example code should...
GitHub user wangmiao1981 opened a pull request: https://github.com/apache/spark/pull/13213 [SPARK-15363][ML][Example]:Example code shouldn't use VectorImplicits._, asML/fromML ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) In this DataFrame example, we use VectorImplicits._, which is private API. Since Vectors object has public API, we use Vectors.fromML instead of implicts. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually run the example. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangmiao1981/spark ml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13213 commit 818dc7fe8f1be835243de8d096d43b229e356cbc Author: wm...@hotmail.comDate: 2016-05-20T05:00:35Z remove VectorImplicits in example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15379][SQL] check special invalid date
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/13169#issuecomment-220518098 LGTM, except some minor comment, thanks for working on it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15379][SQL] check special invalid date
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13169#discussion_r63991642 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala --- @@ -353,6 +353,20 @@ class DateTimeUtilsSuite extends SparkFunSuite { c.getTimeInMillis * 1000 + 123456) } + test("SPARK-15379 :special invalid date string") { --- End diff -- nit: `SPARK-15379: special ...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15114][SQL] Column name generated by ty...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/13045#discussion_r63991554 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala --- @@ -240,4 +240,15 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { val df2 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") checkAnswer(df2.agg(RowAgg.toColumn as "b").select("b"), Row(6) :: Nil) } + + test("spark-15114 shorter system generated alias names") { +val ds = Seq(1, 3, 2, 5).toDS() +assert(ds.select(typed.sum((i: Int) => i)).columns.head === "typedsumdouble_c1") +val ds2 = ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)) +assert(ds2.columns.head === "typedsumdouble_c1") --- End diff -- @cloud-fan Just wanted to show some difference to user between two aggregate expressions like sum(col1), sum(col2) will show up as typedsumdouble_c1 and typedsumdouble_c2. You think its fine to just report without any suffix ? If you think its ok, then may be we can just create resolved Aliases in Column.named as opposed to deferring it to Analyzer ? Please let me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15428][SQL] Disable multiple streaming ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13210#issuecomment-220517869 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58936/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15428][SQL] Disable multiple streaming ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13210#issuecomment-220517868 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15428][SQL] Disable multiple streaming ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13210#issuecomment-220517778 **[Test build #58936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58936/consoleFull)** for PR 13210 at commit [`65d45a9`](https://github.com/apache/spark/commit/65d45a947e905ee14fd8a7556032dd5035182648). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15379][SQL] check special invalid date
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13169#discussion_r63991470 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -426,6 +426,26 @@ object DateTimeUtils { } /** + * Return true if the date is invalid. + */ + private def checkInvalidDate(year: Int, month: Int, day: Int): Boolean = { --- End diff -- nit: as it returns boolean, I think `isInvalidDate` is a better name --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63991418 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala --- @@ -46,3 +46,33 @@ case class AddFile(path: String) extends RunnableCommand { Seq.empty[Row] } } + +/** + * Return a list of file paths that are added to resources. + * If file paths are provided, return the ones that are added to resources. + */ +case class ListFiles(files: Seq[String] = Seq.empty[String]) extends RunnableCommand { --- End diff -- @rxin Thank you very much! I will make the change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63991451 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1387,6 +1387,27 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } /** +* Return a list of file paths that are added to resources. +* If file paths are provided, return the ones that are added to resources. +*/ + def listFiles(files: Seq[String] = Seq.empty[String]): Seq[String] = { --- End diff -- Agree. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63991431 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala --- @@ -46,3 +46,33 @@ case class AddFile(path: String) extends RunnableCommand { Seq.empty[Row] } } + +/** + * Return a list of file paths that are added to resources. + * If file paths are provided, return the ones that are added to resources. + */ +case class ListFiles(files: Seq[String] = Seq.empty[String]) extends RunnableCommand { + override val output: Seq[Attribute] = { +val schema = StructType( + StructField("result", StringType, nullable = false) :: Nil) +schema.toAttributes + } + override def run(sparkSession: SparkSession): Seq[Row] = { +sparkSession.sparkContext.listFiles(files).map(Row(_)) + } +} + +/** + * Return a list of jar files that are added to resources. + * If jar files are provided, return the ones that are added to resources. + */ +case class ListJars(jars: Seq[String] = Seq.empty[String]) extends RunnableCommand { --- End diff -- Will change! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15367] [SQL] Add refreshTable back
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13156#issuecomment-220517365 **[Test build #58940 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58940/consoleFull)** for PR 13156 at commit [`20d5055`](https://github.com/apache/spark/commit/20d50556c6a3a4ca2d69f961822a2bb058edbbec). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63991280 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1387,6 +1387,27 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli } /** +* Return a list of file paths that are added to resources. +* If file paths are provided, return the ones that are added to resources. +*/ + def listFiles(files: Seq[String] = Seq.empty[String]): Seq[String] = { --- End diff -- i think this one should not take any parameter, and if you need filtering, just do it in ListFilesCommand --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63991284 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1724,6 +1745,22 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli postEnvironmentUpdate() } + /** +* Return a list of jar files that are added to resources. +* If jar files are provided, return the ones that are added to resources. +*/ + def listJars(jars: Seq[String] = Seq.empty[String]): Seq[String] = { --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15114][SQL] Column name generated by ty...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13045#discussion_r63991309 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala --- @@ -37,6 +38,14 @@ private[sql] object Column { def apply(expr: Expression): Column = new Column(expr) def unapply(col: Column): Option[Expression] = Some(col.expr) + + private[sql] def generateAlias(e: Expression, index: Int): String = { +e match { + case a: AggregateExpression if a.aggregateFunction.isInstanceOf[TypedAggregateExpression] => +s"${a.aggregateFunction.prettyName}_c${index}" --- End diff -- how about `aggregateFunction.toString`? It carries more information and not that verbose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15114][SQL] Column name generated by ty...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13045#discussion_r63991215 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala --- @@ -240,4 +240,15 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { val df2 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") checkAnswer(df2.agg(RowAgg.toColumn as "b").select("b"), Row(6) :: Nil) } + + test("spark-15114 shorter system generated alias names") { +val ds = Seq(1, 3, 2, 5).toDS() +assert(ds.select(typed.sum((i: Int) => i)).columns.head === "typedsumdouble_c1") +val ds2 = ds.select(typed.sum((i: Int) => i), typed.avg((i: Int) => i)) +assert(ds2.columns.head === "typedsumdouble_c1") --- End diff -- I'm not sure how useful this `_c1` postfix is, maybe we can remove it and simplify the `aliasFunc`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63991184 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala --- @@ -46,3 +46,33 @@ case class AddFile(path: String) extends RunnableCommand { Seq.empty[Row] } } + +/** + * Return a list of file paths that are added to resources. + * If file paths are provided, return the ones that are added to resources. + */ +case class ListFiles(files: Seq[String] = Seq.empty[String]) extends RunnableCommand { + override val output: Seq[Attribute] = { +val schema = StructType( + StructField("result", StringType, nullable = false) :: Nil) +schema.toAttributes + } + override def run(sparkSession: SparkSession): Seq[Row] = { +sparkSession.sparkContext.listFiles(files).map(Row(_)) + } +} + +/** + * Return a list of jar files that are added to resources. + * If jar files are provided, return the ones that are added to resources. + */ +case class ListJars(jars: Seq[String] = Seq.empty[String]) extends RunnableCommand { --- End diff -- ListJarsCommand --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13212#discussion_r63991179 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala --- @@ -46,3 +46,33 @@ case class AddFile(path: String) extends RunnableCommand { Seq.empty[Row] } } + +/** + * Return a list of file paths that are added to resources. + * If file paths are provided, return the ones that are added to resources. + */ +case class ListFiles(files: Seq[String] = Seq.empty[String]) extends RunnableCommand { --- End diff -- ListFilesCommand --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15075][SPARK-15345][SQL] Clean up Spark...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13200 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15114][SQL] Column name generated by ty...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13045#discussion_r63991134 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala --- @@ -325,10 +325,13 @@ case class UnresolvedExtractValue(child: Expression, extraction: Expression) * Holds the expression that has yet to be aliased. * * @param child The computation that is needs to be resolved during analysis. - * @param aliasName The name if specified to be associated with the result of computing [[child]] + * @param aliasFunc The function if specified to be called to generate an alias to associate --- End diff -- we need to say more about the 2 parameters this `aliasFunc` takes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15236][SQL][SPARK SHELL] Add spark-defa...
Github user xwu0226 commented on the pull request: https://github.com/apache/spark/pull/13088#issuecomment-220516958 @rxin @yhuai @andrewor14 Please help check if the updated change is in the right direction, Thank you very much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15075][SPARK-15345][SQL] Clean up Spark...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/13200#issuecomment-220516955 Thanks - merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15367] [SQL] Add refreshTable back
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/13156#issuecomment-220516949 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15114][SQL] Column name generated by ty...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/13045#discussion_r63991063 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -174,14 +174,16 @@ class Analyzer( private def assignAliases(exprs: Seq[NamedExpression]) = { exprs.zipWithIndex.map { case (expr, i) => - expr.transformUp { case u @ UnresolvedAlias(child, optionalAliasName) => + expr.transformUp { case u @ UnresolvedAlias(child, optGenAliasFunc) => child match { case ne: NamedExpression => ne case e if !e.resolved => u case g: Generator => MultiAlias(g, Nil) case c @ Cast(ne: NamedExpression, _) => Alias(c, ne.name)() case e: ExtractValue => Alias(e, toPrettySQL(e))() - case e => Alias(e, optionalAliasName.getOrElse(toPrettySQL(e)))() + case e if optGenAliasFunc.isDefined => +Alias(child, s"${optGenAliasFunc.get.apply(e, i + 1)}")() --- End diff -- nit: we can just use `optGenAliasFunc.get.apply(e, i + 1)`, no need to wrap it with `s"${}"` ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15313][SQL] EmbedSerializerInFilter rul...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/13096#issuecomment-220516852 Can you add the jira ticket somewhere as inline comment in the test case and in the analyzer code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15075][SPARK-15345][SQL] Clean up Spark...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/13200#issuecomment-220516804 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15075][SPARK-15345][SQL] Clean up Spark...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/13200#issuecomment-220516531 @marmbrus i know you were looking at this. Did you end up going through it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user xwu0226 commented on the pull request: https://github.com/apache/spark/pull/13212#issuecomment-220516455 cc @yhuai @hvanhovell @gatorsmile Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13212#issuecomment-220516397 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11827] [SQL] Adding java.math.BigIntege...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10125#issuecomment-220516202 thanks, merging to master and 2.0! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11827] [SQL] Adding java.math.BigIntege...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10125 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15075][SPARK-15345][SQL] Clean up Spark...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13200#issuecomment-220515969 **[Test build #58939 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58939/consoleFull)** for PR 13200 at commit [`e4a4bc1`](https://github.com/apache/spark/commit/e4a4bc1f590770ff95f3fb0277b3e0e8050cec72). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...
GitHub user xwu0226 opened a pull request: https://github.com/apache/spark/pull/13212 [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively ## What changes were proposed in this pull request? Currently command "ADD FILE|JAR" is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because the LIST command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli) This PR is to support following commands: `LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])` ### For example: # LIST FILE(s) ``` scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt") res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false) +--+ |result| +--+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| +--+ scala> spark.sql("list files").show(false) +--+ |result| +--+ |hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt| |hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt | +--+ ``` # LIST JAR(s) ``` scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar") res9: org.apache.spark.sql.DataFrame = [result: int] scala> spark.sql("list jar TestUDTF.jar").show(false) +-+ |result | +-+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +-+ scala> spark.sql("list jars").show(false) +-+ |result | +-+ |spark://192.168.1.234:50131/jars/TestUDTF.jar| +-+ ``` ## How was this patch tested? New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xwu0226/spark list_command Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13212.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13212 commit 3866e3dcbfbd9fe0e18ecde3b23bb14757e06a0c Author: xin Wu Date: 2016-05-08T07:06:36Z spark-15206 add testcases for distinct aggregate in having clause following up PR12974 commit 951d3edc412ef3d6f77d70a4dd7dd7add966d7b1 Author: xin Wu Date: 2016-05-08T07:09:44Z Revert "spark-15206 add testcases for distinct aggregate in having clause following up PR12974" This reverts commit 98a1f804d7343ba77731f9aa400c00f1a26c03fe. commit 5b30cc3c0eb20c134e21942ef96a26e452f9171c Author: xin Wu Date: 2016-05-17T22:09:57Z adding spark native support for LIST FILES/JARS commit 6396ec1591134ca3fd754a6a2684bc8b81218951 Author: xin Wu Date: 2016-05-17T22:52:31Z update testcase commit 79e97be7917d23f44f60cc857a471b14cb96831c Author: xin Wu Date: 2016-05-19T07:07:02Z support listing specific file(s) commit a4dc6164ff51b428dae282aa90042758c4ae33d7 Author: Xin Wu Date: 2016-05-19T07:33:50Z update testcases commit 688c294060cb00cd6c387591bf700e58bdd3dba8 Author: Xin Wu Date: 2016-05-19T22:57:16Z align with PR 13122 commit a0a76a3c5ff93dbf42f07bebd54b7a3514e87132 Author: Xin Wu Date: 2016-05-19T23:07:32Z code style commit 923988ac5d21e0c0afc6bf76d21a27e8f46f1246 Author: Xin Wu Date: 2016-05-19T23:11:36Z code style commit 21b092ab84b22abec93fde1fc1ca177db68d9f0d Author: Xin Wu Date: 2016-05-20T04:16:26Z update comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project
[GitHub] spark pull request: [SPARK-15114][SQL] Column name generated by ty...
Github user dilipbiswal commented on the pull request: https://github.com/apache/spark/pull/13045#issuecomment-220515698 cc @cloud-fan Hi Wenchen, I have made the changes per your comments. Could you please look through it when you get a chance ? Thanks.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15367] [SQL] Add refreshTable back
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/13156#issuecomment-220515502 LGTM, pending jenkins --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15425][SQL] Disallow cartesian joins by...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/13209#discussion_r63990349 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -348,6 +348,11 @@ object SQLConf { .booleanConf .createWithDefault(true) + val CARTESIAN_PRODUCT_ENABLED = SQLConfigBuilder("spark.sql.join.cartesian.enabled") +.doc("When false, we will throw an error if a query contains a cartesian product") +.booleanConf +.createWithDefault(false) + val ORDER_BY_ORDINAL = SQLConfigBuilder("spark.sql.orderByOrdinal") .doc("When true, the ordinal numbers are treated as the position in the select list. " + "When false, the ordinal numbers in order/sort By clause are ignored.") --- End diff -- no it's not this pr but @sameeragarwal can you fix it while you are at it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8603][SPARKR] Incorrect file separator ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/13165#issuecomment-220515342 This raises some questions to me. 1. It seems several tests were failed. Could you please inform me your thoughts? 2. Now, I think I can add some tests but could you please where I should write the related tests and maybe rough ideas of the tests I should add? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8603][SPARKR] Incorrect file separator ...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/13165#issuecomment-220515182 @sun-rui @felixcheung Right. It seems finally I made it. I made gists and upload a PDF file for Spark UI. Let me tell you the test results first. Here is the stdout output for the tests on Windwos 7 32bit, [output.msg](https://gist.github.com/HyukjinKwon/6a10719d2ca67e04ece2b23a8f92dc62). Here is the stderr output for the tests on Windwos 7 32bit, [output.err](https://gist.github.com/HyukjinKwon/54984d57ee18236d46e965d07b31f77a). Here is the PDF for [Spark UI PDF](https://drive.google.com/open?id=0B7RfLjRU7QTnVVA2bkVMVFkzNEE) 1. I run tests after building Spark on Windows according to [`./R/WINDOWS.md`] (https://github.com/apache/spark/blob/master/R/WINDOWS.md) 2. It seems `$HADOOP_HOME` should be set. 3. It seems `winutils.exe` is required which is included in Hadoop official binary although it reads file in the local file system. 4. And run the tests by the command below: ```bash cd bin spark-submit2.cmd --conf spark.hadoop.fs.defualt.name="file:///" ..\R\pkg\tests\run-all.R > output.msg 2> output.err ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15321] Fix bug where Array[Timestamp] c...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13108 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15321] Fix bug where Array[Timestamp] c...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/13108#issuecomment-220514860 LGTM, merging to master and 2.0, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15430][SQL] Fix potential ConcurrentMod...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13211#issuecomment-220514553 **[Test build #58938 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58938/consoleFull)** for PR 13211 at commit [`4d97bf0`](https://github.com/apache/spark/commit/4d97bf093f4f4d41cf530a4c7464532635c2b3fe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15430][SQL] Fix potential ConcurrentMod...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/13211 [SPARK-15430][SQL] Fix potential ConcurrentModificationException for ListAccumulator ## What changes were proposed in this pull request? In `ListAccumulator` we create an unmodifiable view for underlying list. However, it doesn't prevent the underlying to be modified further. So as we access the unmodifiable list, the underlying list can be modified in the same time. It could cause `java.util.ConcurrentModificationException`. We can observe such exception in recent tests. To fix it, we can copy a list of the underlying list and then create the unmodifiable view of this list instead. ## How was this patch tested? The exception might be difficult to test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 fix-concurrentmodify Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13211.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13211 commit 4d97bf093f4f4d41cf530a4c7464532635c2b3fe Author: Liang-Chi HsiehDate: 2016-05-20T04:15:49Z Fix potential ConcurrentModificationException for ListAccumulator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15367] [SQL] Add refreshTable back
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/13156#issuecomment-220513843 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15100][DOC] Modified user guide and exa...
Github user GayathriMurali commented on the pull request: https://github.com/apache/spark/pull/13176#issuecomment-220513197 @hhbyyh Can you please help review this? I will resolve the branch conflict along with review comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15379][SQL] check special invalid date
Github user wangyang1992 commented on the pull request: https://github.com/apache/spark/pull/13169#issuecomment-220512727 @cloud-fan Could you please help me look at this some time? A simple fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15360][Spark-Submit]Should print spark-...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13163#issuecomment-220512285 **[Test build #58937 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58937/consoleFull)** for PR 13163 at commit [`2941e62`](https://github.com/apache/spark/commit/2941e6273d064376f0e540fa0655c345d9c52461). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15360][Spark-Submit]Should print spark-...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/13163#discussion_r63988573 --- Diff: launcher/src/test/java/org/apache/spark/launcher/SparkSubmitCommandBuilderSuite.java --- @@ -59,6 +59,18 @@ public void testClusterCmdBuilder() throws Exception { } @Test + public void testCliHelpAndNoArg() throws Exception { +List sparkSubmitArgs = Arrays.asList(parser.HELP); +Mapenv = new HashMap<>(); +List cmd = buildCommand(sparkSubmitArgs, env); +assertTrue("--help should be contained in the final cmd.", cmd.contains(parser.HELP)); + +List sparkEmptyArgs = Arrays.asList(""); +cmd = buildCommand(sparkSubmitArgs, env); --- End diff -- Sorry for this obvious mistake! It is really a stupid mistake. Thanks for your time! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15321] Fix bug where Array[Timestamp] c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13108#issuecomment-220511906 **[Test build #2999 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2999/consoleFull)** for PR 13108 at commit [`387e6c9`](https://github.com/apache/spark/commit/387e6c912191bed1d4d4e09ede92f6ea1cc85a51). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15428][SQL] Disable multiple streaming ...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/13210#issuecomment-220511611 cc @marmbrus --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15335] [SQL] Implement TRUNCATE TABLE C...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13170#issuecomment-220511333 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58935/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15335] [SQL] Implement TRUNCATE TABLE C...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13170#issuecomment-220511331 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15428][SQL] Disable multiple streaming ...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/13210#discussion_r63988111 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala --- @@ -55,10 +55,19 @@ object UnsupportedOperationChecker { case _: InsertIntoTable => throwError("InsertIntoTable is not supported with streaming DataFrames/Datasets") -case Aggregate(_, _, child) if child.isStreaming && outputMode == Append => - throwError( -"Aggregations are not supported on streaming DataFrames/Datasets in " + - "Append output mode. Consider changing output mode to Update.") +case Aggregate(_, _, child) if child.isStreaming => + if (outputMode == Append) { +throwError( + "Aggregations are not supported on streaming DataFrames/Datasets in " + +"Append output mode. Consider changing output mode to Update.") --- End diff -- I didnt get you. IntelliJ seems to catching all the uses of Append object properly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15335] [SQL] Implement TRUNCATE TABLE C...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13170#issuecomment-220511206 **[Test build #58935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58935/consoleFull)** for PR 13170 at commit [`10377ba`](https://github.com/apache/spark/commit/10377ba78f26d9aa42502d0b5cfeea561ff96162). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org