spark git commit: [SPARK-15122] [SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out
Repository: spark Updated Branches: refs/heads/branch-2.0 4ccc5643f -> 49e666138 [SPARK-15122] [SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out ## What changes were proposed in this pull request? The official TPC-DS 41 query currently fails because it contains a scalar subquery with a disjunctive correlated predicate (the correlated predicates were nested in ORs). This makes the `Analyzer` pull out the entire predicate which is wrong and causes the following (correct) analysis exception: `The correlated scalar subquery can only contain equality predicates` This PR fixes this by first simplifing (or normalizing) the correlated predicates before pulling them out of the subquery. ## How was this patch tested? Manual testing on TPC-DS 41, and added a test to SubquerySuite. Author: Herman van HovellCloses #12954 from hvanhovell/SPARK-15122. (cherry picked from commit df89f1d43d4eaa1dd8a439a8e48bca16b67d5b48) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/49e66613 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/49e66613 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/49e66613 Branch: refs/heads/branch-2.0 Commit: 49e666138b2c74b8145faf7adc6fd090656e5ea0 Parents: 4ccc564 Author: Herman van Hovell Authored: Fri May 6 21:06:03 2016 -0700 Committer: Davies Liu Committed: Fri May 6 21:06:14 2016 -0700 -- .../apache/spark/sql/catalyst/analysis/Analyzer.scala | 4 +++- .../test/scala/org/apache/spark/sql/SubquerySuite.scala | 12 2 files changed, 15 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/49e66613/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 527d5b6..9e9a856 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -26,6 +26,7 @@ import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog} import org.apache.spark.sql.catalyst.encoders.OuterScopes import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.expressions.aggregate._ +import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification import org.apache.spark.sql.catalyst.planning.IntegerIndex import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, _} @@ -958,7 +959,8 @@ class Analyzer( localPredicateReferences -- p.outputSet } - val transformed = sub transformUp { + // Simplify the predicates before pulling them out. + val transformed = BooleanSimplification(sub) transformUp { case f @ Filter(cond, child) => // Find all predicates with an outer reference. val (correlated, local) = splitConjunctivePredicates(cond).partition(containsOuter) http://git-wip-us.apache.org/repos/asf/spark/blob/49e66613/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala index 80bb4e0..17ac0c8 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala @@ -281,4 +281,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext { assert(msg1.getMessage.contains( "The correlated scalar subquery can only contain equality predicates")) } + + test("disjunctive correlated scalar subquery") { +checkAnswer( + sql(""" +|select a +|from l +|where (select count(*) +|from r +|where (a = c and d = 2.0) or (a = c and d = 1.0)) > 0 +""".stripMargin), + Row(3) :: Nil) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15051][SQL] Create a TypedColumn alias
Repository: spark Updated Branches: refs/heads/branch-2.0 f6d7292d1 -> 4ccc5643f [SPARK-15051][SQL] Create a TypedColumn alias ## What changes were proposed in this pull request? Currently when we create an alias against a TypedColumn from user-defined Aggregator(for example: agg(aggSum.toColumn as "a")), spark is using the alias' function from Column( as), the alias function will return a column contains a TypedAggregateExpression, which is unresolved because the inputDeserializer is not defined. Later the aggregator function (agg) will inject the inputDeserializer back to the TypedAggregateExpression, but only if the aggregate columns are TypedColumn, in the above case, the TypedAggregateExpression will remain unresolved because it is under column and caused the problem reported by this jira [15051](https://issues.apache.org/jira/browse/SPARK-15051?jql=project%20%3D%20SPARK). This PR propose to create an alias function for TypedColumn, it will return a TypedColumn. It is using the similar code path as Column's alia function. For the spark build in aggregate function, like max, it is working with alias, for example val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") checkAnswer(df1.agg(max("j") as "b"), Row(3) :: Nil) Thanks for comments. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Add test cases in DatasetAggregatorSuite.scala run the sql related queries against this patch. Author: Kevin YuCloses #12893 from kevinyu98/spark-15051. (cherry picked from commit 607a27a0d149be049091bcf274a73b8476b36c90) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ccc5643 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ccc5643 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ccc5643 Branch: refs/heads/branch-2.0 Commit: 4ccc5643f90133f5c80514915fd5616c77837f10 Parents: f6d7292 Author: Kevin Yu Authored: Sat May 7 11:13:48 2016 +0800 Committer: Wenchen Fan Committed: Sat May 7 11:14:02 2016 +0800 -- .../main/scala/org/apache/spark/sql/Column.scala | 19 +-- .../spark/sql/DatasetAggregatorSuite.scala | 8 2 files changed, 21 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4ccc5643/sql/core/src/main/scala/org/apache/spark/sql/Column.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala index c58adda..9b8334d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala @@ -68,6 +68,18 @@ class TypedColumn[-T, U]( } new TypedColumn[T, U](newExpr, encoder) } + + /** + * Gives the TypedColumn a name (alias). + * If the current TypedColumn has metadata associated with it, this metadata will be propagated + * to the new column. + * + * @group expr_ops + * @since 2.0.0 + */ + override def name(alias: String): TypedColumn[T, U] = +new TypedColumn[T, U](super.name(alias).expr, encoder) + } /** @@ -910,12 +922,7 @@ class Column(protected[sql] val expr: Expression) extends Logging { * @group expr_ops * @since 1.3.0 */ - def as(alias: Symbol): Column = withExpr { -expr match { - case ne: NamedExpression => Alias(expr, alias.name)(explicitMetadata = Some(ne.metadata)) - case other => Alias(other, alias.name)() -} - } + def as(alias: Symbol): Column = name(alias.name) /** * Gives the column an alias with metadata. http://git-wip-us.apache.org/repos/asf/spark/blob/4ccc5643/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala index 6eae3ed..b2a0f3d 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala @@ -232,4 +232,12 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { "a" -> Seq(1, 2) ) } + + test("spark-15051 alias of aggregator in DataFrame/Dataset[Row]") { +val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") +checkAnswer(df1.agg(RowAgg.toColumn as "b"), Row(6) :: Nil) + +val df2 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") +checkAnswer(df2.agg(RowAgg.toColumn as "b").select("b"),
spark git commit: [SPARK-15051][SQL] Create a TypedColumn alias
Repository: spark Updated Branches: refs/heads/master a21a3bbe6 -> 607a27a0d [SPARK-15051][SQL] Create a TypedColumn alias ## What changes were proposed in this pull request? Currently when we create an alias against a TypedColumn from user-defined Aggregator(for example: agg(aggSum.toColumn as "a")), spark is using the alias' function from Column( as), the alias function will return a column contains a TypedAggregateExpression, which is unresolved because the inputDeserializer is not defined. Later the aggregator function (agg) will inject the inputDeserializer back to the TypedAggregateExpression, but only if the aggregate columns are TypedColumn, in the above case, the TypedAggregateExpression will remain unresolved because it is under column and caused the problem reported by this jira [15051](https://issues.apache.org/jira/browse/SPARK-15051?jql=project%20%3D%20SPARK). This PR propose to create an alias function for TypedColumn, it will return a TypedColumn. It is using the similar code path as Column's alia function. For the spark build in aggregate function, like max, it is working with alias, for example val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") checkAnswer(df1.agg(max("j") as "b"), Row(3) :: Nil) Thanks for comments. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Add test cases in DatasetAggregatorSuite.scala run the sql related queries against this patch. Author: Kevin YuCloses #12893 from kevinyu98/spark-15051. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/607a27a0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/607a27a0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/607a27a0 Branch: refs/heads/master Commit: 607a27a0d149be049091bcf274a73b8476b36c90 Parents: a21a3bb Author: Kevin Yu Authored: Sat May 7 11:13:48 2016 +0800 Committer: Wenchen Fan Committed: Sat May 7 11:13:48 2016 +0800 -- .../main/scala/org/apache/spark/sql/Column.scala | 19 +-- .../spark/sql/DatasetAggregatorSuite.scala | 8 2 files changed, 21 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/607a27a0/sql/core/src/main/scala/org/apache/spark/sql/Column.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala index c58adda..9b8334d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala @@ -68,6 +68,18 @@ class TypedColumn[-T, U]( } new TypedColumn[T, U](newExpr, encoder) } + + /** + * Gives the TypedColumn a name (alias). + * If the current TypedColumn has metadata associated with it, this metadata will be propagated + * to the new column. + * + * @group expr_ops + * @since 2.0.0 + */ + override def name(alias: String): TypedColumn[T, U] = +new TypedColumn[T, U](super.name(alias).expr, encoder) + } /** @@ -910,12 +922,7 @@ class Column(protected[sql] val expr: Expression) extends Logging { * @group expr_ops * @since 1.3.0 */ - def as(alias: Symbol): Column = withExpr { -expr match { - case ne: NamedExpression => Alias(expr, alias.name)(explicitMetadata = Some(ne.metadata)) - case other => Alias(other, alias.name)() -} - } + def as(alias: Symbol): Column = name(alias.name) /** * Gives the column an alias with metadata. http://git-wip-us.apache.org/repos/asf/spark/blob/607a27a0/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala index 6eae3ed..b2a0f3d 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala @@ -232,4 +232,12 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { "a" -> Seq(1, 2) ) } + + test("spark-15051 alias of aggregator in DataFrame/Dataset[Row]") { +val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") +checkAnswer(df1.agg(RowAgg.toColumn as "b"), Row(6) :: Nil) + +val df2 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") +checkAnswer(df2.agg(RowAgg.toColumn as "b").select("b"), Row(6) :: Nil) + } } - To unsubscribe, e-mail:
spark git commit: [SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments
Repository: spark Updated Branches: refs/heads/branch-2.0 d98dd72e7 -> f6d7292d1 [SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments ## What changes were proposed in this pull request? Remove the Comment, since it not longer applies. see the discussion here(https://github.com/apache/spark/pull/12865#discussion-diff-61946906) Author: Sandeep SinghCloses #12953 from techaddict/SPARK-15087-FOLLOW-UP. (cherry picked from commit a21a3bbe6931e162c53a61daff1ef428fb802b8a) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f6d7292d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f6d7292d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f6d7292d Branch: refs/heads/branch-2.0 Commit: f6d7292d1c46ba04cba72ae798bdffbdfd97aa53 Parents: d98dd72 Author: Sandeep Singh Authored: Sat May 7 11:10:14 2016 +0800 Committer: Wenchen Fan Committed: Sat May 7 11:10:44 2016 +0800 -- .../scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala| 5 - 1 file changed, 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f6d7292d/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala index 8ce8fb1..371fb86 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala @@ -389,11 +389,6 @@ private[spark] class TaskSchedulerImpl( // (taskId, stageId, stageAttemptId, accumUpdates) val accumUpdatesWithTaskIds: Array[(Long, Int, Int, Seq[AccumulableInfo])] = synchronized { accumUpdates.flatMap { case (id, updates) => -// We should call `acc.value` here as we are at driver side now. However, the RPC framework -// optimizes local message delivery so that messages do not need to de serialized and -// deserialized. This brings trouble to the accumulator framework, which depends on -// serialization to set the `atDriverSide` flag. Here we call `acc.localValue` instead to -// be more robust about this issue. val accInfos = updates.map(acc => acc.toInfo(Some(acc.value), None)) taskIdToTaskSetManager.get(id).map { taskSetMgr => (id, taskSetMgr.stageId, taskSetMgr.taskSet.stageAttemptId, accInfos) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments
Repository: spark Updated Branches: refs/heads/master cc95f1ed5 -> a21a3bbe6 [SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments ## What changes were proposed in this pull request? Remove the Comment, since it not longer applies. see the discussion here(https://github.com/apache/spark/pull/12865#discussion-diff-61946906) Author: Sandeep SinghCloses #12953 from techaddict/SPARK-15087-FOLLOW-UP. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a21a3bbe Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a21a3bbe Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a21a3bbe Branch: refs/heads/master Commit: a21a3bbe6931e162c53a61daff1ef428fb802b8a Parents: cc95f1e Author: Sandeep Singh Authored: Sat May 7 11:10:14 2016 +0800 Committer: Wenchen Fan Committed: Sat May 7 11:10:14 2016 +0800 -- .../scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala| 5 - 1 file changed, 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a21a3bbe/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala index 8ce8fb1..371fb86 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala @@ -389,11 +389,6 @@ private[spark] class TaskSchedulerImpl( // (taskId, stageId, stageAttemptId, accumUpdates) val accumUpdatesWithTaskIds: Array[(Long, Int, Int, Seq[AccumulableInfo])] = synchronized { accumUpdates.flatMap { case (id, updates) => -// We should call `acc.value` here as we are at driver side now. However, the RPC framework -// optimizes local message delivery so that messages do not need to de serialized and -// deserialized. This brings trouble to the accumulator framework, which depends on -// serialization to set the `atDriverSide` flag. Here we call `acc.localValue` instead to -// be more robust about this issue. val accInfos = updates.map(acc => acc.toInfo(Some(acc.value), None)) taskIdToTaskSetManager.get(id).map { taskSetMgr => (id, taskSetMgr.stageId, taskSetMgr.taskSet.stageAttemptId, accInfos) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-1239] Improve fetching of map output statuses
Repository: spark Updated Branches: refs/heads/branch-2.0 dc1562e97 -> d98dd72e7 [SPARK-1239] Improve fetching of map output statuses The main issue we are trying to solve is the memory bloat of the Driver when tasks request the map output statuses. This means with a large number of tasks you either need a huge amount of memory on Driver or you have to repartition to smaller number. This makes it really difficult to run over say 5 tasks. The main issues that cause the memory bloat are: 1) no flow control on sending the map output status responses. We serialize the map status output and then hand off to netty to send. netty is sending asynchronously and it can't send them fast enough to keep up with incoming requests so we end up with lots of copies of the serialized map output statuses sitting there and this causes huge bloat when you have 10's of thousands of tasks and map output status is in the 10's of MB. 2) When initial reduce tasks are started up, they all request the map output statuses from the Driver. These requests are handled by multiple threads in parallel so even though we check to see if we have a cached version, initially when we don't have a cached version yet, many of initial requests can all end up serializing the exact same map output statuses. This patch does a couple of things: - When the map output status size is over a threshold (default 512K) then it uses broadcast to send the map statuses. This means we no longer serialize a large map output status and thus we don't have issues with memory bloat. the messages sizes are now in the 300-400 byte range and the map status output are broadcast. If its under the threadshold it sends it as before, the message contains the DIRECT indicator now. - synchronize the incoming requests to allow one thread to cache the serialized output and broadcast the map output status that can then be used by everyone else. This ensures we don't create multiple broadcast variables when we don't need to. To ensure this happens I added a second thread pool which the Dispatcher hands the requests to so that those threads can block without blocking the main dispatcher threads (which would cause things like heartbeats and such not to come through) Note that some of design and code was contributed by mridulm ## How was this patch tested? Unit tests and a lot of manually testing. Ran with akka and netty rpc. Ran with both dynamic allocation on and off. one of the large jobs I used to test this was a join of 15TB of data. it had 200,000 map tasks, and 20,000 reduce tasks. Executors ranged from 200 to 2000. This job ran successfully with 5GB of memory on the driver with these changes. Without these changes I was using 20GB and only had 500 reduce tasks. The job has 50mb of serialized map output statuses and took roughly the same amount of time for the executors to get the map output statuses as before. Ran a variety of other jobs, from large wordcounts to small ones not using broadcasts. Author: Thomas GravesCloses #12113 from tgravescs/SPARK-1239. (cherry picked from commit cc95f1ed5fdf2566bcefe8d10116eee544cf9184) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d98dd72e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d98dd72e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d98dd72e Branch: refs/heads/branch-2.0 Commit: d98dd72e7baeb59eacec4fefd66397513a607b2f Parents: dc1562e Author: Thomas Graves Authored: Fri May 6 19:31:26 2016 -0700 Committer: Davies Liu Committed: Fri May 6 19:31:35 2016 -0700 -- .../org/apache/spark/MapOutputTracker.scala | 250 +++ .../main/scala/org/apache/spark/SparkEnv.scala | 6 +- .../apache/spark/MapOutputTrackerSuite.scala| 99 +--- .../spark/scheduler/DAGSchedulerSuite.scala | 7 +- .../storage/BlockManagerReplicationSuite.scala | 4 +- .../spark/storage/BlockManagerSuite.scala | 4 +- .../streaming/ReceivedBlockHandlerSuite.scala | 4 +- 7 files changed, 290 insertions(+), 84 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d98dd72e/core/src/main/scala/org/apache/spark/MapOutputTracker.scala -- diff --git a/core/src/main/scala/org/apache/spark/MapOutputTracker.scala b/core/src/main/scala/org/apache/spark/MapOutputTracker.scala index 3a5caa3..6bd9502 100644 --- a/core/src/main/scala/org/apache/spark/MapOutputTracker.scala +++ b/core/src/main/scala/org/apache/spark/MapOutputTracker.scala @@ -18,13 +18,15 @@ package org.apache.spark import java.io._ -import
spark git commit: [SPARK-1239] Improve fetching of map output statuses
Repository: spark Updated Branches: refs/heads/master f7b7ef416 -> cc95f1ed5 [SPARK-1239] Improve fetching of map output statuses The main issue we are trying to solve is the memory bloat of the Driver when tasks request the map output statuses. This means with a large number of tasks you either need a huge amount of memory on Driver or you have to repartition to smaller number. This makes it really difficult to run over say 5 tasks. The main issues that cause the memory bloat are: 1) no flow control on sending the map output status responses. We serialize the map status output and then hand off to netty to send. netty is sending asynchronously and it can't send them fast enough to keep up with incoming requests so we end up with lots of copies of the serialized map output statuses sitting there and this causes huge bloat when you have 10's of thousands of tasks and map output status is in the 10's of MB. 2) When initial reduce tasks are started up, they all request the map output statuses from the Driver. These requests are handled by multiple threads in parallel so even though we check to see if we have a cached version, initially when we don't have a cached version yet, many of initial requests can all end up serializing the exact same map output statuses. This patch does a couple of things: - When the map output status size is over a threshold (default 512K) then it uses broadcast to send the map statuses. This means we no longer serialize a large map output status and thus we don't have issues with memory bloat. the messages sizes are now in the 300-400 byte range and the map status output are broadcast. If its under the threadshold it sends it as before, the message contains the DIRECT indicator now. - synchronize the incoming requests to allow one thread to cache the serialized output and broadcast the map output status that can then be used by everyone else. This ensures we don't create multiple broadcast variables when we don't need to. To ensure this happens I added a second thread pool which the Dispatcher hands the requests to so that those threads can block without blocking the main dispatcher threads (which would cause things like heartbeats and such not to come through) Note that some of design and code was contributed by mridulm ## How was this patch tested? Unit tests and a lot of manually testing. Ran with akka and netty rpc. Ran with both dynamic allocation on and off. one of the large jobs I used to test this was a join of 15TB of data. it had 200,000 map tasks, and 20,000 reduce tasks. Executors ranged from 200 to 2000. This job ran successfully with 5GB of memory on the driver with these changes. Without these changes I was using 20GB and only had 500 reduce tasks. The job has 50mb of serialized map output statuses and took roughly the same amount of time for the executors to get the map output statuses as before. Ran a variety of other jobs, from large wordcounts to small ones not using broadcasts. Author: Thomas GravesCloses #12113 from tgravescs/SPARK-1239. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cc95f1ed Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cc95f1ed Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cc95f1ed Branch: refs/heads/master Commit: cc95f1ed5fdf2566bcefe8d10116eee544cf9184 Parents: f7b7ef4 Author: Thomas Graves Authored: Fri May 6 19:31:26 2016 -0700 Committer: Davies Liu Committed: Fri May 6 19:31:26 2016 -0700 -- .../org/apache/spark/MapOutputTracker.scala | 250 +++ .../main/scala/org/apache/spark/SparkEnv.scala | 6 +- .../apache/spark/MapOutputTrackerSuite.scala| 99 +--- .../spark/scheduler/DAGSchedulerSuite.scala | 7 +- .../storage/BlockManagerReplicationSuite.scala | 4 +- .../spark/storage/BlockManagerSuite.scala | 4 +- .../streaming/ReceivedBlockHandlerSuite.scala | 4 +- 7 files changed, 290 insertions(+), 84 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cc95f1ed/core/src/main/scala/org/apache/spark/MapOutputTracker.scala -- diff --git a/core/src/main/scala/org/apache/spark/MapOutputTracker.scala b/core/src/main/scala/org/apache/spark/MapOutputTracker.scala index 3a5caa3..6bd9502 100644 --- a/core/src/main/scala/org/apache/spark/MapOutputTracker.scala +++ b/core/src/main/scala/org/apache/spark/MapOutputTracker.scala @@ -18,13 +18,15 @@ package org.apache.spark import java.io._ -import java.util.concurrent.ConcurrentHashMap +import java.util.concurrent.{ConcurrentHashMap, LinkedBlockingQueue, ThreadPoolExecutor}
spark git commit: [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths
Repository: spark Updated Branches: refs/heads/branch-2.0 22f9f5f97 -> dc1562e97 [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths ## What changes were proposed in this pull request? Lets says there are json files in the following directories structure ``` xyz/file0.json xyz/subdir1/file1.json xyz/subdir2/file2.json xyz/subdir1/subsubdir1/file3.json ``` `sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read. The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files). Closes #12774 ## How was this patch tested? unit tests Author: Tathagata DasCloses #12856 from tdas/SPARK-14997. (cherry picked from commit f7b7ef41662d7d02fc4f834f3c6c4ee8802e949c) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc1562e9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc1562e9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc1562e9 Branch: refs/heads/branch-2.0 Commit: dc1562e97d570238f8532b3f8051e8df90722732 Parents: 22f9f5f Author: Tathagata Das Authored: Fri May 6 15:04:16 2016 -0700 Committer: Yin Huai Committed: Fri May 6 15:04:27 2016 -0700 -- .../PartitioningAwareFileCatalog.scala | 24 +- .../datasources/FileCatalogSuite.scala | 68 ++ .../ParquetPartitionDiscoverySuite.scala| 47 .../sql/streaming/FileStreamSourceSuite.scala | 15 +- .../sql/sources/HadoopFsRelationTest.scala | 232 +-- 5 files changed, 356 insertions(+), 30 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dc1562e9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala index 2c44b39..5f04a6c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala @@ -61,7 +61,29 @@ abstract class PartitioningAwareFileCatalog( } } - override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq + override def allFiles(): Seq[FileStatus] = { +if (partitionSpec().partitionColumns.isEmpty) { + // For each of the input paths, get the list of files inside them + paths.flatMap { path => +// Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel). +val fs = path.getFileSystem(hadoopConf) +val qualifiedPath = fs.makeQualified(path) + +// There are three cases possible with each path +// 1. The path is a directory and has children files in it. Then it must be present in +//leafDirToChildrenFiles as those children files will have been found as leaf files. +//Find its children files from leafDirToChildrenFiles and include them. +// 2. The path is a file, then it will be present in leafFiles. Include this path. +// 3. The path is a directory, but has no children files. Do not include this path. + +leafDirToChildrenFiles.get(qualifiedPath) + .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } + .getOrElse(Array.empty) + } +} else { + leafFiles.values.toSeq +} + } protected def inferPartitioning(): PartitionSpec = { // We use leaf dirs containing data files to discover the schema. http://git-wip-us.apache.org/repos/asf/spark/blob/dc1562e9/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala new file mode 100644 index 000..dab5c76 --- /dev/null +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright
spark git commit: [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths
Repository: spark Updated Branches: refs/heads/master e20cd9f4c -> f7b7ef416 [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths ## What changes were proposed in this pull request? Lets says there are json files in the following directories structure ``` xyz/file0.json xyz/subdir1/file1.json xyz/subdir2/file2.json xyz/subdir1/subsubdir1/file3.json ``` `sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read. The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files). Closes #12774 ## How was this patch tested? unit tests Author: Tathagata DasCloses #12856 from tdas/SPARK-14997. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f7b7ef41 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f7b7ef41 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f7b7ef41 Branch: refs/heads/master Commit: f7b7ef41662d7d02fc4f834f3c6c4ee8802e949c Parents: e20cd9f Author: Tathagata Das Authored: Fri May 6 15:04:16 2016 -0700 Committer: Yin Huai Committed: Fri May 6 15:04:16 2016 -0700 -- .../PartitioningAwareFileCatalog.scala | 24 +- .../datasources/FileCatalogSuite.scala | 68 ++ .../ParquetPartitionDiscoverySuite.scala| 47 .../sql/streaming/FileStreamSourceSuite.scala | 15 +- .../sql/sources/HadoopFsRelationTest.scala | 232 +-- 5 files changed, 356 insertions(+), 30 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f7b7ef41/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala index 2c44b39..5f04a6c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala @@ -61,7 +61,29 @@ abstract class PartitioningAwareFileCatalog( } } - override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq + override def allFiles(): Seq[FileStatus] = { +if (partitionSpec().partitionColumns.isEmpty) { + // For each of the input paths, get the list of files inside them + paths.flatMap { path => +// Make the path qualified (consistent with listLeafFiles and listLeafFilesInParallel). +val fs = path.getFileSystem(hadoopConf) +val qualifiedPath = fs.makeQualified(path) + +// There are three cases possible with each path +// 1. The path is a directory and has children files in it. Then it must be present in +//leafDirToChildrenFiles as those children files will have been found as leaf files. +//Find its children files from leafDirToChildrenFiles and include them. +// 2. The path is a file, then it will be present in leafFiles. Include this path. +// 3. The path is a directory, but has no children files. Do not include this path. + +leafDirToChildrenFiles.get(qualifiedPath) + .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } + .getOrElse(Array.empty) + } +} else { + leafFiles.values.toSeq +} + } protected def inferPartitioning(): PartitionSpec = { // We use leaf dirs containing data files to discover the schema. http://git-wip-us.apache.org/repos/asf/spark/blob/f7b7ef41/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala new file mode 100644 index 000..dab5c76 --- /dev/null +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this
spark git commit: [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover
Repository: spark Updated Branches: refs/heads/branch-2.0 1e6b158b1 -> 22f9f5f97 [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover ## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak KöseAuthor: Xiangrui Meng Author: Burak KOSE Closes #12843 from mengxr/SPARK-14050. (cherry picked from commit e20cd9f4ce977739ce80a2c39f8ebae5e53f72f6) Signed-off-by: Xiangrui Meng Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/22f9f5f9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/22f9f5f9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/22f9f5f9 Branch: refs/heads/branch-2.0 Commit: 22f9f5f97221a128f7a91d347fa2ace7de7045aa Parents: 1e6b158 Author: Burak Köse Authored: Fri May 6 13:58:12 2016 -0700 Committer: Xiangrui Meng Committed: Fri May 6 13:58:24 2016 -0700 -- licenses/LICENSE-postgresql.txt | 24 ++ .../apache/spark/ml/feature/stopwords/README| 12 + .../spark/ml/feature/stopwords/danish.txt | 94 ++ .../apache/spark/ml/feature/stopwords/dutch.txt | 101 ++ .../spark/ml/feature/stopwords/english.txt | 153 + .../spark/ml/feature/stopwords/finnish.txt | 235 ++ .../spark/ml/feature/stopwords/french.txt | 155 + .../spark/ml/feature/stopwords/german.txt | 231 ++ .../spark/ml/feature/stopwords/hungarian.txt| 199 .../spark/ml/feature/stopwords/italian.txt | 279 + .../spark/ml/feature/stopwords/norwegian.txt| 176 +++ .../spark/ml/feature/stopwords/portuguese.txt | 203 .../spark/ml/feature/stopwords/russian.txt | 151 + .../spark/ml/feature/stopwords/spanish.txt | 313 +++ .../spark/ml/feature/stopwords/swedish.txt | 114 +++ .../spark/ml/feature/stopwords/turkish.txt | 53 .../spark/ml/feature/StopWordsRemover.scala | 106 +++ .../ml/feature/StopWordsRemoverSuite.scala | 57 +++- python/pyspark/ml/feature.py| 38 ++- python/pyspark/ml/tests.py | 7 + 20 files changed, 2614 insertions(+), 87 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/22f9f5f9/licenses/LICENSE-postgresql.txt -- diff --git a/licenses/LICENSE-postgresql.txt b/licenses/LICENSE-postgresql.txt new file mode 100644 index 000..515bf9a --- /dev/null +++ b/licenses/LICENSE-postgresql.txt @@ -0,0 +1,24 @@ +PostgreSQL Database Management System +(formerly known as Postgres, then as Postgres95) + +Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group + +Portions Copyright (c) 1994, The Regents of the University of California + +Permission to use, copy, modify, and distribute this software and its +documentation for any purpose, without fee, and without a written agreement +is hereby granted, provided that the above copyright notice and this +paragraph and the following two paragraphs appear in all copies. + +IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR +DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING +LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS +DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY +AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS +ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO +PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. + http://git-wip-us.apache.org/repos/asf/spark/blob/22f9f5f9/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README -- diff --git a/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README new file mode 100755 index 000..ec08a50 --- /dev/null +++ b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README @@ -0,0 +1,12 @@ +Stopwords Corpus + +This corpus contains lists of stop words for several languages. These +are high-frequency grammatical words
spark git commit: [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover
Repository: spark Updated Branches: refs/heads/master 5c8fad7b9 -> e20cd9f4c [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover ## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak KöseAuthor: Xiangrui Meng Author: Burak KOSE Closes #12843 from mengxr/SPARK-14050. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e20cd9f4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e20cd9f4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e20cd9f4 Branch: refs/heads/master Commit: e20cd9f4ce977739ce80a2c39f8ebae5e53f72f6 Parents: 5c8fad7 Author: Burak Köse Authored: Fri May 6 13:58:12 2016 -0700 Committer: Xiangrui Meng Committed: Fri May 6 13:58:12 2016 -0700 -- licenses/LICENSE-postgresql.txt | 24 ++ .../apache/spark/ml/feature/stopwords/README| 12 + .../spark/ml/feature/stopwords/danish.txt | 94 ++ .../apache/spark/ml/feature/stopwords/dutch.txt | 101 ++ .../spark/ml/feature/stopwords/english.txt | 153 + .../spark/ml/feature/stopwords/finnish.txt | 235 ++ .../spark/ml/feature/stopwords/french.txt | 155 + .../spark/ml/feature/stopwords/german.txt | 231 ++ .../spark/ml/feature/stopwords/hungarian.txt| 199 .../spark/ml/feature/stopwords/italian.txt | 279 + .../spark/ml/feature/stopwords/norwegian.txt| 176 +++ .../spark/ml/feature/stopwords/portuguese.txt | 203 .../spark/ml/feature/stopwords/russian.txt | 151 + .../spark/ml/feature/stopwords/spanish.txt | 313 +++ .../spark/ml/feature/stopwords/swedish.txt | 114 +++ .../spark/ml/feature/stopwords/turkish.txt | 53 .../spark/ml/feature/StopWordsRemover.scala | 106 +++ .../ml/feature/StopWordsRemoverSuite.scala | 57 +++- python/pyspark/ml/feature.py| 38 ++- python/pyspark/ml/tests.py | 7 + 20 files changed, 2614 insertions(+), 87 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e20cd9f4/licenses/LICENSE-postgresql.txt -- diff --git a/licenses/LICENSE-postgresql.txt b/licenses/LICENSE-postgresql.txt new file mode 100644 index 000..515bf9a --- /dev/null +++ b/licenses/LICENSE-postgresql.txt @@ -0,0 +1,24 @@ +PostgreSQL Database Management System +(formerly known as Postgres, then as Postgres95) + +Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group + +Portions Copyright (c) 1994, The Regents of the University of California + +Permission to use, copy, modify, and distribute this software and its +documentation for any purpose, without fee, and without a written agreement +is hereby granted, provided that the above copyright notice and this +paragraph and the following two paragraphs appear in all copies. + +IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR +DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING +LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS +DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY +AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS +ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO +PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. + http://git-wip-us.apache.org/repos/asf/spark/blob/e20cd9f4/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README -- diff --git a/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README new file mode 100755 index 000..ec08a50 --- /dev/null +++ b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README @@ -0,0 +1,12 @@ +Stopwords Corpus + +This corpus contains lists of stop words for several languages. These +are high-frequency grammatical words which are usually ignored in text +retrieval applications. + +They were obtained from:
spark git commit: [SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor Thread
Repository: spark Updated Branches: refs/heads/branch-1.6 a3aa22a59 -> ab006523b [SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor Thread Temp patch for branch 1.6ï¼ avoid deadlock between BlockManager and Executor Thread. Author: cenyuhaiCloses #11546 from cenyuhai/SPARK-13566. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab006523 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab006523 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab006523 Branch: refs/heads/branch-1.6 Commit: ab006523b840b1d2dbf3f5ff0a238558e7665a1e Parents: a3aa22a Author: cenyuhai Authored: Fri May 6 13:50:49 2016 -0700 Committer: Andrew Or Committed: Fri May 6 13:50:49 2016 -0700 -- .../org/apache/spark/executor/Executor.scala| 12 ++ .../org/apache/spark/storage/BlockManager.scala | 192 --- .../spark/storage/BlockManagerSuite.scala | 38 3 files changed, 170 insertions(+), 72 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ab006523/core/src/main/scala/org/apache/spark/executor/Executor.scala -- diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala b/core/src/main/scala/org/apache/spark/executor/Executor.scala index ab5bde5..b248e12 100644 --- a/core/src/main/scala/org/apache/spark/executor/Executor.scala +++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala @@ -218,6 +218,7 @@ private[spark] class Executor( threwException = false res } finally { + val releasedLocks = env.blockManager.releaseAllLocksForTask(taskId) val freedMemory = taskMemoryManager.cleanUpAllAllocatedMemory() if (freedMemory > 0) { val errMsg = s"Managed memory leak detected; size = $freedMemory bytes, TID = $taskId" @@ -227,6 +228,17 @@ private[spark] class Executor( logError(errMsg) } } + + if (releasedLocks.nonEmpty) { +val errMsg = + s"${releasedLocks.size} block locks were not released by TID = $taskId:\n" + + releasedLocks.mkString("[", ", ", "]") +if (conf.getBoolean("spark.storage.exceptionOnPinLeak", false) && !threwException) { + throw new SparkException(errMsg) +} else { + logError(errMsg) +} + } } val taskFinish = System.currentTimeMillis() http://git-wip-us.apache.org/repos/asf/spark/blob/ab006523/core/src/main/scala/org/apache/spark/storage/BlockManager.scala -- diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala index 538272d..288f756 100644 --- a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala +++ b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala @@ -19,12 +19,14 @@ package org.apache.spark.storage import java.io._ import java.nio.{ByteBuffer, MappedByteBuffer} +import java.util.concurrent.ConcurrentHashMap import scala.collection.mutable.{ArrayBuffer, HashMap} import scala.concurrent.duration._ import scala.concurrent.{Await, ExecutionContext, Future} import scala.util.Random import scala.util.control.NonFatal +import scala.collection.JavaConverters._ import sun.nio.ch.DirectBuffer @@ -65,7 +67,7 @@ private[spark] class BlockManager( val master: BlockManagerMaster, defaultSerializer: Serializer, val conf: SparkConf, -memoryManager: MemoryManager, +val memoryManager: MemoryManager, mapOutputTracker: MapOutputTracker, shuffleManager: ShuffleManager, blockTransferService: BlockTransferService, @@ -164,6 +166,11 @@ private[spark] class BlockManager( * loaded yet. */ private lazy val compressionCodec: CompressionCodec = CompressionCodec.createCodec(conf) + // Blocks are removing by another thread + val pendingToRemove = new ConcurrentHashMap[BlockId, Long]() + + private val NON_TASK_WRITER = -1024L + /** * Initializes the BlockManager with the given appId. This is not performed in the constructor as * the appId may not be known at BlockManager instantiation time (in particular for the driver, @@ -1025,54 +1032,58 @@ private[spark] class BlockManager( val info = blockInfo.get(blockId).orNull // If the block has not already been dropped -if (info != null) { - info.synchronized { -// required ? As of now, this will be invoked only for blocks which are ready -// But in case this changes in future, adding for
spark git commit: [SPARK-15108][SQL] Describe Permanent UDTF
Repository: spark Updated Branches: refs/heads/master 76ad04d9a -> 5c8fad7b9 [SPARK-15108][SQL] Describe Permanent UDTF What changes were proposed in this pull request? When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry. This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function. How was this patch tested? Added test cases to verify the results Author: gatorsmileCloses #12885 from gatorsmile/showFunction. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c8fad7b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c8fad7b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c8fad7b Branch: refs/heads/master Commit: 5c8fad7b9bfd6677111a8e27e2574f82b04ec479 Parents: 76ad04d Author: gatorsmile Authored: Fri May 6 11:43:07 2016 -0700 Committer: Yin Huai Committed: Fri May 6 11:43:07 2016 -0700 -- .../catalyst/analysis/NoSuchItemException.scala | 7 ++- .../sql/catalyst/catalog/SessionCatalog.scala | 8 ++- .../spark/sql/catalyst/parser/AstBuilder.scala | 6 +-- .../sql/catalyst/plans/logical/commands.scala | 4 +- .../analysis/UnsupportedOperationsSuite.scala | 3 +- .../sql/catalyst/parser/PlanParserSuite.scala | 12 +++-- .../spark/sql/execution/command/functions.scala | 20 .../org/apache/spark/sql/SQLQuerySuite.scala| 2 +- .../thriftserver/HiveThriftServer2Suites.scala | 4 +- .../spark/sql/hive/client/HiveClient.scala | 2 +- .../sql/hive/execution/SQLQuerySuite.scala | 54 ++-- 11 files changed, 91 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5c8fad7b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala index 2412ec4..ff13bce 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala @@ -37,5 +37,10 @@ class NoSuchPartitionException( extends AnalysisException( s"Partition not found in table '$table' database '$db':\n" + spec.mkString("\n")) -class NoSuchFunctionException(db: String, func: String) +class NoSuchPermanentFunctionException(db: String, func: String) extends AnalysisException(s"Function '$func' not found in database '$db'") + +class NoSuchFunctionException(db: String, func: String) + extends AnalysisException( +s"Undefined function: '$func'. This function is neither a registered temporary function nor " + +s"a permanent function registered in the database '$db'.") http://git-wip-us.apache.org/repos/asf/spark/blob/5c8fad7b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala index 7127707..9918bce 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala @@ -28,7 +28,7 @@ import org.apache.spark.internal.Logging import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.{CatalystConf, SimpleCatalystConf} import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} -import org.apache.spark.sql.catalyst.analysis.{FunctionRegistry, NoSuchFunctionException, SimpleFunctionRegistry} +import org.apache.spark.sql.catalyst.analysis.{FunctionRegistry, NoSuchFunctionException, NoSuchPermanentFunctionException, SimpleFunctionRegistry} import org.apache.spark.sql.catalyst.analysis.FunctionRegistry.FunctionBuilder import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionInfo} import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, SubqueryAlias} @@ -644,9 +644,7 @@ class SessionCatalog( } protected def failFunctionLookup(name: String): Nothing = { -throw new AnalysisException(s"Undefined function: '$name'. This function is " + - s"neither a registered temporary function nor " + - s"a
spark git commit: [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC
Repository: spark Updated Branches: refs/heads/branch-2.0 3f6a13c8a -> d7c755561 [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14962 ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`. This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318) and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2) This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) when building a filter ([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260)) but Hive 1.2.x seems not doing this. This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`. Author: hyukjinkwonAuthor: Hyukjin Kwon Closes #12777 from HyukjinKwon/SPARK-14962. (cherry picked from commit fa928ff9a3c1de5d5aff9d14e6bc1bd03fcca087) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7c75556 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7c75556 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7c75556 Branch: refs/heads/branch-2.0 Commit: d7c755561270ee8ec1c44df2e10a8bcb4985c3de Parents: 3f6a13c Author: hyukjinkwon Authored: Sat May 7 01:46:45 2016 +0800 Committer: Cheng Lian Committed: Sat May 7 01:53:08 2016 +0800 -- .../apache/spark/sql/test/SQLTestUtils.scala| 2 +- .../apache/spark/sql/hive/orc/OrcFilters.scala | 63 .../apache/spark/sql/hive/orc/OrcRelation.scala | 19 ++--- .../spark/sql/hive/orc/OrcFilterSuite.scala | 75 .../spark/sql/hive/orc/OrcQuerySuite.scala | 14 .../spark/sql/hive/orc/OrcSourceSuite.scala | 9 ++- 6 files changed, 126 insertions(+), 56 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d7c75556/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala index ffb206a..6d2b95e 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala @@ -213,7 +213,7 @@ private[sql] trait SQLTestUtils */ protected def stripSparkFilter(df: DataFrame): DataFrame = { val schema = df.schema -val withoutFilters = df.queryExecution.sparkPlan transform { +val withoutFilters = df.queryExecution.sparkPlan.transform { case FilterExec(_, child) => child } http://git-wip-us.apache.org/repos/asf/spark/blob/d7c75556/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala index c025c12..c463bc8 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala @@ -17,13 +17,12 @@ package org.apache.spark.sql.hive.orc -import org.apache.hadoop.hive.common.`type`.{HiveChar, HiveDecimal, HiveVarchar} import org.apache.hadoop.hive.ql.io.sarg.{SearchArgument, SearchArgumentFactory} import org.apache.hadoop.hive.ql.io.sarg.SearchArgument.Builder -import org.apache.hadoop.hive.serde2.io.DateWritable import org.apache.spark.internal.Logging import
spark git commit: [SPARK-14512] [DOC] Add python example for QuantileDiscretizer
Repository: spark Updated Branches: refs/heads/master fa928ff9a -> 76ad04d9a [SPARK-14512] [DOC] Add python example for QuantileDiscretizer ## What changes were proposed in this pull request? Add the missing python example for QuantileDiscretizer ## How was this patch tested? manual tests Author: Zheng RuiFengCloses #12281 from zhengruifeng/discret_pe. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76ad04d9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76ad04d9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76ad04d9 Branch: refs/heads/master Commit: 76ad04d9a0a7d4dfb762318d9c7be0d7720f4e1a Parents: fa928ff Author: Zheng RuiFeng Authored: Fri May 6 10:47:13 2016 -0700 Committer: Davies Liu Committed: Fri May 6 10:47:13 2016 -0700 -- docs/ml-features.md | 9 + .../python/ml/quantile_discretizer_example.py | 39 2 files changed, 48 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/76ad04d9/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 0b8f2d7..237e93a 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1118,6 +1118,15 @@ for more details on the API. {% include_example java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java %} + + + +Refer to the [QuantileDiscretizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.QuantileDiscretizer) +for more details on the API. + +{% include_example python/ml/quantile_discretizer_example.py %} + + # Feature Selectors http://git-wip-us.apache.org/repos/asf/spark/blob/76ad04d9/examples/src/main/python/ml/quantile_discretizer_example.py -- diff --git a/examples/src/main/python/ml/quantile_discretizer_example.py b/examples/src/main/python/ml/quantile_discretizer_example.py new file mode 100644 index 000..6ae7bb1 --- /dev/null +++ b/examples/src/main/python/ml/quantile_discretizer_example.py @@ -0,0 +1,39 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import print_function + +# $example on$ +from pyspark.ml.feature import QuantileDiscretizer +# $example off$ +from pyspark.sql import SparkSession + + +if __name__ == "__main__": +spark = SparkSession.builder.appName("PythonQuantileDiscretizerExample").getOrCreate() + +# $example on$ +data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)] +dataFrame = spark.createDataFrame(data, ["id", "hour"]) + +discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result") + +result = discretizer.fit(dataFrame).transform(dataFrame) +result.show() +# $example off$ + +spark.stop() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14512] [DOC] Add python example for QuantileDiscretizer
Repository: spark Updated Branches: refs/heads/branch-2.0 1ee621b1d -> 3f6a13c8a [SPARK-14512] [DOC] Add python example for QuantileDiscretizer ## What changes were proposed in this pull request? Add the missing python example for QuantileDiscretizer ## How was this patch tested? manual tests Author: Zheng RuiFengCloses #12281 from zhengruifeng/discret_pe. (cherry picked from commit 76ad04d9a0a7d4dfb762318d9c7be0d7720f4e1a) Signed-off-by: Davies Liu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3f6a13c8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3f6a13c8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3f6a13c8 Branch: refs/heads/branch-2.0 Commit: 3f6a13c8a49e15c5f88415837f49f8e81092177b Parents: 1ee621b Author: Zheng RuiFeng Authored: Fri May 6 10:47:13 2016 -0700 Committer: Davies Liu Committed: Fri May 6 10:47:36 2016 -0700 -- docs/ml-features.md | 9 + .../python/ml/quantile_discretizer_example.py | 39 2 files changed, 48 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3f6a13c8/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 0b8f2d7..237e93a 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1118,6 +1118,15 @@ for more details on the API. {% include_example java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java %} + + + +Refer to the [QuantileDiscretizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.QuantileDiscretizer) +for more details on the API. + +{% include_example python/ml/quantile_discretizer_example.py %} + + # Feature Selectors http://git-wip-us.apache.org/repos/asf/spark/blob/3f6a13c8/examples/src/main/python/ml/quantile_discretizer_example.py -- diff --git a/examples/src/main/python/ml/quantile_discretizer_example.py b/examples/src/main/python/ml/quantile_discretizer_example.py new file mode 100644 index 000..6ae7bb1 --- /dev/null +++ b/examples/src/main/python/ml/quantile_discretizer_example.py @@ -0,0 +1,39 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import print_function + +# $example on$ +from pyspark.ml.feature import QuantileDiscretizer +# $example off$ +from pyspark.sql import SparkSession + + +if __name__ == "__main__": +spark = SparkSession.builder.appName("PythonQuantileDiscretizerExample").getOrCreate() + +# $example on$ +data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)] +dataFrame = spark.createDataFrame(data, ["id", "hour"]) + +discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result") + +result = discretizer.fit(dataFrame).transform(dataFrame) +result.show() +# $example off$ + +spark.stop() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC
Repository: spark Updated Branches: refs/heads/master a03c5e68a -> fa928ff9a [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14962 ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`. This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318) and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2) This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56)) when building a filter ([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260)) but Hive 1.2.x seems not doing this. This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`. Author: hyukjinkwonAuthor: Hyukjin Kwon Closes #12777 from HyukjinKwon/SPARK-14962. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa928ff9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa928ff9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa928ff9 Branch: refs/heads/master Commit: fa928ff9a3c1de5d5aff9d14e6bc1bd03fcca087 Parents: a03c5e6 Author: hyukjinkwon Authored: Sat May 7 01:46:45 2016 +0800 Committer: Cheng Lian Committed: Sat May 7 01:46:45 2016 +0800 -- .../apache/spark/sql/test/SQLTestUtils.scala| 2 +- .../apache/spark/sql/hive/orc/OrcFilters.scala | 63 .../apache/spark/sql/hive/orc/OrcRelation.scala | 19 ++--- .../spark/sql/hive/orc/OrcFilterSuite.scala | 75 .../spark/sql/hive/orc/OrcQuerySuite.scala | 14 .../spark/sql/hive/orc/OrcSourceSuite.scala | 9 ++- 6 files changed, 126 insertions(+), 56 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fa928ff9/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala index ffb206a..6d2b95e 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala @@ -213,7 +213,7 @@ private[sql] trait SQLTestUtils */ protected def stripSparkFilter(df: DataFrame): DataFrame = { val schema = df.schema -val withoutFilters = df.queryExecution.sparkPlan transform { +val withoutFilters = df.queryExecution.sparkPlan.transform { case FilterExec(_, child) => child } http://git-wip-us.apache.org/repos/asf/spark/blob/fa928ff9/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala index c025c12..c463bc8 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala @@ -17,13 +17,12 @@ package org.apache.spark.sql.hive.orc -import org.apache.hadoop.hive.common.`type`.{HiveChar, HiveDecimal, HiveVarchar} import org.apache.hadoop.hive.ql.io.sarg.{SearchArgument, SearchArgumentFactory} import org.apache.hadoop.hive.ql.io.sarg.SearchArgument.Builder -import org.apache.hadoop.hive.serde2.io.DateWritable import org.apache.spark.internal.Logging import org.apache.spark.sql.sources._ +import org.apache.spark.sql.types._ /** * Helper object for building ORC `SearchArgument`s,
spark git commit: [SPARK-14738][BUILD] Separate docker integration tests from main build
Repository: spark Updated Branches: refs/heads/branch-2.0 42f2ee6c5 -> 1ee621b1d [SPARK-14738][BUILD] Separate docker integration tests from main build ## What changes were proposed in this pull request? Create a maven profile for executing the docker integration tests using maven Remove docker integration tests from main sbt build Update documentation on how to run docker integration tests from sbt ## How was this patch tested? Manual test of the docker integration tests as in : mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test ## Other comments Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348 Author: Luciano ResendeCloses #12508 from lresende/docker. (cherry picked from commit a03c5e68abd8c066c97ebd33070d59dce1a7) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ee621b1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ee621b1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ee621b1 Branch: refs/heads/branch-2.0 Commit: 1ee621b1d949ce8e1bb41ef3fe19dfaad4a90ab1 Parents: 42f2ee6 Author: Luciano Resende Authored: Fri May 6 12:25:45 2016 +0100 Committer: Sean Owen Committed: Fri May 6 14:24:06 2016 +0100 -- docs/building-spark.md | 12 .../apache/spark/sql/jdbc/MySQLIntegrationSuite.scala | 3 --- .../apache/spark/sql/jdbc/OracleIntegrationSuite.scala | 5 + .../spark/sql/jdbc/PostgresIntegrationSuite.scala | 3 --- pom.xml | 8 +++- project/SparkBuild.scala| 3 ++- 6 files changed, 22 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1ee621b1/docs/building-spark.md -- diff --git a/docs/building-spark.md b/docs/building-spark.md index fec442a..13c95e4 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -190,6 +190,18 @@ or Java 8 tests are automatically enabled when a Java 8 JDK is detected. If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests. +# Running Docker based Integration Test Suites + +Running only docker based integration tests and nothing else. + +mvn install -DskipTests +mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 + +or + +sbt docker-integration-tests/test + + # Packaging without Hadoop Dependencies for YARN The assembly directory produced by `mvn package` will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with `yarn.application.classpath`. The `hadoop-provided` profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself. http://git-wip-us.apache.org/repos/asf/spark/blob/1ee621b1/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala -- diff --git a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala index aa47228..a70ed98 100644 --- a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala +++ b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala @@ -21,12 +21,9 @@ import java.math.BigDecimal import java.sql.{Connection, Date, Timestamp} import java.util.Properties -import org.scalatest.Ignore - import org.apache.spark.tags.DockerTest @DockerTest -@Ignore class MySQLIntegrationSuite extends DockerJDBCIntegrationSuite { override val db = new DatabaseOnDocker { override val imageName = "mysql:5.7.9" http://git-wip-us.apache.org/repos/asf/spark/blob/1ee621b1/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala -- diff --git a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
spark git commit: [SPARK-14738][BUILD] Separate docker integration tests from main build
Repository: spark Updated Branches: refs/heads/master 157a49aa4 -> a03c5e68a [SPARK-14738][BUILD] Separate docker integration tests from main build ## What changes were proposed in this pull request? Create a maven profile for executing the docker integration tests using maven Remove docker integration tests from main sbt build Update documentation on how to run docker integration tests from sbt ## How was this patch tested? Manual test of the docker integration tests as in : mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile test ## Other comments Note that the the DB2 Docker Tests are still disabled as there is a kernel version issue on the AMPLab Jenkins slaves and we would need to get them on the right level before enabling those tests. They do run ok locally with the updates from PR #12348 Author: Luciano ResendeCloses #12508 from lresende/docker. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a03c5e68 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a03c5e68 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a03c5e68 Branch: refs/heads/master Commit: a03c5e68abd8c066c97ebd33070d59dce1a7 Parents: 157a49a Author: Luciano Resende Authored: Fri May 6 12:25:45 2016 +0100 Committer: Sean Owen Committed: Fri May 6 12:25:45 2016 +0100 -- docs/building-spark.md | 12 .../apache/spark/sql/jdbc/MySQLIntegrationSuite.scala | 3 --- .../apache/spark/sql/jdbc/OracleIntegrationSuite.scala | 5 + .../spark/sql/jdbc/PostgresIntegrationSuite.scala | 3 --- pom.xml | 8 +++- project/SparkBuild.scala| 3 ++- 6 files changed, 22 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a03c5e68/docs/building-spark.md -- diff --git a/docs/building-spark.md b/docs/building-spark.md index fec442a..13c95e4 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -190,6 +190,18 @@ or Java 8 tests are automatically enabled when a Java 8 JDK is detected. If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests. +# Running Docker based Integration Test Suites + +Running only docker based integration tests and nothing else. + +mvn install -DskipTests +mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 + +or + +sbt docker-integration-tests/test + + # Packaging without Hadoop Dependencies for YARN The assembly directory produced by `mvn package` will, by default, include all of Spark's dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this causes multiple versions of these to appear on executor classpaths: the version packaged in the Spark assembly and the version on each node, included with `yarn.application.classpath`. The `hadoop-provided` profile builds the assembly without including Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself. http://git-wip-us.apache.org/repos/asf/spark/blob/a03c5e68/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala -- diff --git a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala index aa47228..a70ed98 100644 --- a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala +++ b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala @@ -21,12 +21,9 @@ import java.math.BigDecimal import java.sql.{Connection, Date, Timestamp} import java.util.Properties -import org.scalatest.Ignore - import org.apache.spark.tags.DockerTest @DockerTest -@Ignore class MySQLIntegrationSuite extends DockerJDBCIntegrationSuite { override val db = new DatabaseOnDocker { override val imageName = "mysql:5.7.9" http://git-wip-us.apache.org/repos/asf/spark/blob/a03c5e68/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala -- diff --git a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala index 357866b..c5e1f86 100644
spark git commit: [SPARK-14915] Fix incorrect resolution of merge conflict in commit bf3c0608f1779b4dd837b8289ec1d4516e145aea
Repository: spark Updated Branches: refs/heads/branch-1.6 bf3c0608f -> a3aa22a59 [SPARK-14915] Fix incorrect resolution of merge conflict in commit bf3c0608f1779b4dd837b8289ec1d4516e145aea ## What changes were proposed in this pull request? I botched the back-port of SPARK-14915 to branch-1.6 in https://github.com/apache/spark/commit/bf3c0608f1779b4dd837b8289ec1d4516e145aea resulting in a code block being added twice. This simply removes it, such that the net change is the intended one. ## How was this patch tested? Jenkins tests. (This in theory has already been tested.) Author: Sean OwenCloses #12950 from srowen/SPARK-14915.2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3aa22a5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3aa22a5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3aa22a5 Branch: refs/heads/branch-1.6 Commit: a3aa22a5915c2cc6bdd6810227a3698c59823009 Parents: bf3c060 Author: Sean Owen Authored: Fri May 6 12:21:25 2016 +0100 Committer: Sean Owen Committed: Fri May 6 12:21:25 2016 +0100 -- .../scala/org/apache/spark/scheduler/TaskSetManager.scala | 9 - 1 file changed, 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a3aa22a5/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala index 3ca701d..77a8a19 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala @@ -730,15 +730,6 @@ private[spark] class TaskSetManager( addPendingTask(index) } -if (successful(index)) { - logInfo( -s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " + -"but another instance of the task has already succeeded, " + -"so not re-queuing the task to be re-executed.") -} else { - addPendingTask(index) -} - if (!isZombie && state != TaskState.KILLED && reason.isInstanceOf[TaskFailedReason] && reason.asInstanceOf[TaskFailedReason].countTowardsTaskFailures) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org