spark git commit: [SPARK-15122] [SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out

2016-05-06 Thread davies
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 4ccc5643f -> 49e666138


[SPARK-15122] [SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out

## What changes were proposed in this pull request?
The official TPC-DS 41 query currently fails because it contains a scalar 
subquery with a disjunctive correlated predicate (the correlated predicates 
were nested in ORs). This makes the `Analyzer` pull out the entire predicate 
which is wrong and causes the following (correct) analysis exception: `The 
correlated scalar subquery can only contain equality predicates`

This PR fixes this by first simplifing (or normalizing) the correlated 
predicates before pulling them out of the subquery.

## How was this patch tested?
Manual testing on TPC-DS 41, and added a test to SubquerySuite.

Author: Herman van Hovell 

Closes #12954 from hvanhovell/SPARK-15122.

(cherry picked from commit df89f1d43d4eaa1dd8a439a8e48bca16b67d5b48)
Signed-off-by: Davies Liu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/49e66613
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/49e66613
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/49e66613

Branch: refs/heads/branch-2.0
Commit: 49e666138b2c74b8145faf7adc6fd090656e5ea0
Parents: 4ccc564
Author: Herman van Hovell 
Authored: Fri May 6 21:06:03 2016 -0700
Committer: Davies Liu 
Committed: Fri May 6 21:06:14 2016 -0700

--
 .../apache/spark/sql/catalyst/analysis/Analyzer.scala   |  4 +++-
 .../test/scala/org/apache/spark/sql/SubquerySuite.scala | 12 
 2 files changed, 15 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/49e66613/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 527d5b6..9e9a856 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -26,6 +26,7 @@ import 
org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
 import org.apache.spark.sql.catalyst.encoders.OuterScopes
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.optimizer.BooleanSimplification
 import org.apache.spark.sql.catalyst.planning.IntegerIndex
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, _}
@@ -958,7 +959,8 @@ class Analyzer(
 localPredicateReferences -- p.outputSet
   }
 
-  val transformed = sub transformUp {
+  // Simplify the predicates before pulling them out.
+  val transformed = BooleanSimplification(sub) transformUp {
 case f @ Filter(cond, child) =>
   // Find all predicates with an outer reference.
   val (correlated, local) = 
splitConjunctivePredicates(cond).partition(containsOuter)

http://git-wip-us.apache.org/repos/asf/spark/blob/49e66613/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
index 80bb4e0..17ac0c8 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala
@@ -281,4 +281,16 @@ class SubquerySuite extends QueryTest with 
SharedSQLContext {
 assert(msg1.getMessage.contains(
   "The correlated scalar subquery can only contain equality predicates"))
   }
+
+  test("disjunctive correlated scalar subquery") {
+checkAnswer(
+  sql("""
+|select a
+|from   l
+|where  (select count(*)
+|from   r
+|where (a = c and d = 2.0) or (a = c and d = 1.0)) > 0
+""".stripMargin),
+  Row(3) :: Nil)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15051][SQL] Create a TypedColumn alias

2016-05-06 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f6d7292d1 -> 4ccc5643f


[SPARK-15051][SQL] Create a TypedColumn alias

## What changes were proposed in this pull request?

Currently when we create an alias against a TypedColumn from user-defined 
Aggregator(for example: agg(aggSum.toColumn as "a")), spark is using the alias' 
function from Column( as), the alias function will return a column contains a 
TypedAggregateExpression, which is unresolved because the inputDeserializer is 
not defined. Later the aggregator function (agg) will inject the 
inputDeserializer back to the TypedAggregateExpression, but only if the 
aggregate columns are TypedColumn, in the above case, the 
TypedAggregateExpression will remain unresolved because it is under column and 
caused the
problem reported by this jira 
[15051](https://issues.apache.org/jira/browse/SPARK-15051?jql=project%20%3D%20SPARK).

This PR propose to create an alias function for TypedColumn,  it will return a 
TypedColumn. It is using the similar code path  as Column's alia function.

For the spark build in aggregate function, like max, it is working with alias, 
for example

val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
checkAnswer(df1.agg(max("j") as "b"), Row(3) :: Nil)

Thanks for comments.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

Add test cases in DatasetAggregatorSuite.scala
run the sql related queries against this patch.

Author: Kevin Yu 

Closes #12893 from kevinyu98/spark-15051.

(cherry picked from commit 607a27a0d149be049091bcf274a73b8476b36c90)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ccc5643
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ccc5643
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ccc5643

Branch: refs/heads/branch-2.0
Commit: 4ccc5643f90133f5c80514915fd5616c77837f10
Parents: f6d7292
Author: Kevin Yu 
Authored: Sat May 7 11:13:48 2016 +0800
Committer: Wenchen Fan 
Committed: Sat May 7 11:14:02 2016 +0800

--
 .../main/scala/org/apache/spark/sql/Column.scala | 19 +--
 .../spark/sql/DatasetAggregatorSuite.scala   |  8 
 2 files changed, 21 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4ccc5643/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
index c58adda..9b8334d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
@@ -68,6 +68,18 @@ class TypedColumn[-T, U](
 }
 new TypedColumn[T, U](newExpr, encoder)
   }
+
+  /**
+   * Gives the TypedColumn a name (alias).
+   * If the current TypedColumn has metadata associated with it, this metadata 
will be propagated
+   * to the new column.
+   *
+   * @group expr_ops
+   * @since 2.0.0
+   */
+  override def name(alias: String): TypedColumn[T, U] =
+new TypedColumn[T, U](super.name(alias).expr, encoder)
+
 }
 
 /**
@@ -910,12 +922,7 @@ class Column(protected[sql] val expr: Expression) extends 
Logging {
* @group expr_ops
* @since 1.3.0
*/
-  def as(alias: Symbol): Column = withExpr {
-expr match {
-  case ne: NamedExpression => Alias(expr, alias.name)(explicitMetadata = 
Some(ne.metadata))
-  case other => Alias(other, alias.name)()
-}
-  }
+  def as(alias: Symbol): Column = name(alias.name)
 
   /**
* Gives the column an alias with metadata.

http://git-wip-us.apache.org/repos/asf/spark/blob/4ccc5643/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
index 6eae3ed..b2a0f3d 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
@@ -232,4 +232,12 @@ class DatasetAggregatorSuite extends QueryTest with 
SharedSQLContext {
   "a" -> Seq(1, 2)
 )
   }
+
+  test("spark-15051 alias of aggregator in DataFrame/Dataset[Row]") {
+val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
+checkAnswer(df1.agg(RowAgg.toColumn as "b"), Row(6) :: Nil)
+
+val df2 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
+checkAnswer(df2.agg(RowAgg.toColumn as "b").select("b"), 

spark git commit: [SPARK-15051][SQL] Create a TypedColumn alias

2016-05-06 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master a21a3bbe6 -> 607a27a0d


[SPARK-15051][SQL] Create a TypedColumn alias

## What changes were proposed in this pull request?

Currently when we create an alias against a TypedColumn from user-defined 
Aggregator(for example: agg(aggSum.toColumn as "a")), spark is using the alias' 
function from Column( as), the alias function will return a column contains a 
TypedAggregateExpression, which is unresolved because the inputDeserializer is 
not defined. Later the aggregator function (agg) will inject the 
inputDeserializer back to the TypedAggregateExpression, but only if the 
aggregate columns are TypedColumn, in the above case, the 
TypedAggregateExpression will remain unresolved because it is under column and 
caused the
problem reported by this jira 
[15051](https://issues.apache.org/jira/browse/SPARK-15051?jql=project%20%3D%20SPARK).

This PR propose to create an alias function for TypedColumn,  it will return a 
TypedColumn. It is using the similar code path  as Column's alia function.

For the spark build in aggregate function, like max, it is working with alias, 
for example

val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
checkAnswer(df1.agg(max("j") as "b"), Row(3) :: Nil)

Thanks for comments.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

Add test cases in DatasetAggregatorSuite.scala
run the sql related queries against this patch.

Author: Kevin Yu 

Closes #12893 from kevinyu98/spark-15051.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/607a27a0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/607a27a0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/607a27a0

Branch: refs/heads/master
Commit: 607a27a0d149be049091bcf274a73b8476b36c90
Parents: a21a3bb
Author: Kevin Yu 
Authored: Sat May 7 11:13:48 2016 +0800
Committer: Wenchen Fan 
Committed: Sat May 7 11:13:48 2016 +0800

--
 .../main/scala/org/apache/spark/sql/Column.scala | 19 +--
 .../spark/sql/DatasetAggregatorSuite.scala   |  8 
 2 files changed, 21 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/607a27a0/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
index c58adda..9b8334d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
@@ -68,6 +68,18 @@ class TypedColumn[-T, U](
 }
 new TypedColumn[T, U](newExpr, encoder)
   }
+
+  /**
+   * Gives the TypedColumn a name (alias).
+   * If the current TypedColumn has metadata associated with it, this metadata 
will be propagated
+   * to the new column.
+   *
+   * @group expr_ops
+   * @since 2.0.0
+   */
+  override def name(alias: String): TypedColumn[T, U] =
+new TypedColumn[T, U](super.name(alias).expr, encoder)
+
 }
 
 /**
@@ -910,12 +922,7 @@ class Column(protected[sql] val expr: Expression) extends 
Logging {
* @group expr_ops
* @since 1.3.0
*/
-  def as(alias: Symbol): Column = withExpr {
-expr match {
-  case ne: NamedExpression => Alias(expr, alias.name)(explicitMetadata = 
Some(ne.metadata))
-  case other => Alias(other, alias.name)()
-}
-  }
+  def as(alias: Symbol): Column = name(alias.name)
 
   /**
* Gives the column an alias with metadata.

http://git-wip-us.apache.org/repos/asf/spark/blob/607a27a0/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
index 6eae3ed..b2a0f3d 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
@@ -232,4 +232,12 @@ class DatasetAggregatorSuite extends QueryTest with 
SharedSQLContext {
   "a" -> Seq(1, 2)
 )
   }
+
+  test("spark-15051 alias of aggregator in DataFrame/Dataset[Row]") {
+val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
+checkAnswer(df1.agg(RowAgg.toColumn as "b"), Row(6) :: Nil)
+
+val df2 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j")
+checkAnswer(df2.agg(RowAgg.toColumn as "b").select("b"), Row(6) :: Nil)
+  }
 }


-
To unsubscribe, e-mail: 

spark git commit: [SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments

2016-05-06 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 d98dd72e7 -> f6d7292d1


[SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments

## What changes were proposed in this pull request?
Remove the Comment, since it not longer applies. see the discussion 
here(https://github.com/apache/spark/pull/12865#discussion-diff-61946906)

Author: Sandeep Singh 

Closes #12953 from techaddict/SPARK-15087-FOLLOW-UP.

(cherry picked from commit a21a3bbe6931e162c53a61daff1ef428fb802b8a)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f6d7292d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f6d7292d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f6d7292d

Branch: refs/heads/branch-2.0
Commit: f6d7292d1c46ba04cba72ae798bdffbdfd97aa53
Parents: d98dd72
Author: Sandeep Singh 
Authored: Sat May 7 11:10:14 2016 +0800
Committer: Wenchen Fan 
Committed: Sat May 7 11:10:44 2016 +0800

--
 .../scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala| 5 -
 1 file changed, 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f6d7292d/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala 
b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
index 8ce8fb1..371fb86 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
@@ -389,11 +389,6 @@ private[spark] class TaskSchedulerImpl(
 // (taskId, stageId, stageAttemptId, accumUpdates)
 val accumUpdatesWithTaskIds: Array[(Long, Int, Int, Seq[AccumulableInfo])] 
= synchronized {
   accumUpdates.flatMap { case (id, updates) =>
-// We should call `acc.value` here as we are at driver side now.  
However, the RPC framework
-// optimizes local message delivery so that messages do not need to de 
serialized and
-// deserialized.  This brings trouble to the accumulator framework, 
which depends on
-// serialization to set the `atDriverSide` flag.  Here we call 
`acc.localValue` instead to
-// be more robust about this issue.
 val accInfos = updates.map(acc => acc.toInfo(Some(acc.value), None))
 taskIdToTaskSetManager.get(id).map { taskSetMgr =>
   (id, taskSetMgr.stageId, taskSetMgr.taskSet.stageAttemptId, accInfos)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments

2016-05-06 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master cc95f1ed5 -> a21a3bbe6


[SPARK-15087][MINOR][DOC] Follow Up: Fix the Comments

## What changes were proposed in this pull request?
Remove the Comment, since it not longer applies. see the discussion 
here(https://github.com/apache/spark/pull/12865#discussion-diff-61946906)

Author: Sandeep Singh 

Closes #12953 from techaddict/SPARK-15087-FOLLOW-UP.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a21a3bbe
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a21a3bbe
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a21a3bbe

Branch: refs/heads/master
Commit: a21a3bbe6931e162c53a61daff1ef428fb802b8a
Parents: cc95f1e
Author: Sandeep Singh 
Authored: Sat May 7 11:10:14 2016 +0800
Committer: Wenchen Fan 
Committed: Sat May 7 11:10:14 2016 +0800

--
 .../scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala| 5 -
 1 file changed, 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a21a3bbe/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala 
b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
index 8ce8fb1..371fb86 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
@@ -389,11 +389,6 @@ private[spark] class TaskSchedulerImpl(
 // (taskId, stageId, stageAttemptId, accumUpdates)
 val accumUpdatesWithTaskIds: Array[(Long, Int, Int, Seq[AccumulableInfo])] 
= synchronized {
   accumUpdates.flatMap { case (id, updates) =>
-// We should call `acc.value` here as we are at driver side now.  
However, the RPC framework
-// optimizes local message delivery so that messages do not need to de 
serialized and
-// deserialized.  This brings trouble to the accumulator framework, 
which depends on
-// serialization to set the `atDriverSide` flag.  Here we call 
`acc.localValue` instead to
-// be more robust about this issue.
 val accInfos = updates.map(acc => acc.toInfo(Some(acc.value), None))
 taskIdToTaskSetManager.get(id).map { taskSetMgr =>
   (id, taskSetMgr.stageId, taskSetMgr.taskSet.stageAttemptId, accInfos)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-1239] Improve fetching of map output statuses

2016-05-06 Thread davies
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 dc1562e97 -> d98dd72e7


[SPARK-1239] Improve fetching of map output statuses

The main issue we are trying to solve is the memory bloat of the Driver when 
tasks request the map output statuses.  This means with a large number of tasks 
you either need a huge amount of memory on Driver or you have to repartition to 
smaller number.  This makes it really difficult to run over say 5 tasks.

The main issues that cause the memory bloat are:
1) no flow control on sending the map output status responses.  We serialize 
the map status output  and then hand off to netty to send.  netty is sending 
asynchronously and it can't send them fast enough to keep up with incoming 
requests so we end up with lots of copies of the serialized map output statuses 
sitting there and this causes huge bloat when you have 10's of thousands of 
tasks and map output status is in the 10's of MB.
2) When initial reduce tasks are started up, they all request the map output 
statuses from the Driver. These requests are handled by multiple threads in 
parallel so even though we check to see if we have a cached version, initially 
when we don't have a cached version yet, many of initial requests can all end 
up serializing the exact same map output statuses.

This patch does a couple of things:
- When the map output status size is over a threshold (default 512K) then it 
uses broadcast to send the map statuses.  This means we no longer serialize a 
large map output status and thus we don't have issues with memory bloat.  the 
messages sizes are now in the 300-400 byte range and the map status output are 
broadcast. If its under the threadshold it sends it as before, the message 
contains the DIRECT indicator now.
- synchronize the incoming requests to allow one thread to cache the serialized 
output and broadcast the map output status  that can then be used by everyone 
else.  This ensures we don't create multiple broadcast variables when we don't 
need to.  To ensure this happens I added a second thread pool which the 
Dispatcher hands the requests to so that those threads can block without 
blocking the main dispatcher threads (which would cause things like heartbeats 
and such not to come through)

Note that some of design and code was contributed by mridulm

## How was this patch tested?

Unit tests and a lot of manually testing.
Ran with akka and netty rpc. Ran with both dynamic allocation on and off.

one of the large jobs I used to test this was a join of 15TB of data.  it had 
200,000 map tasks, and  20,000 reduce tasks. Executors ranged from 200 to 2000. 
 This job ran successfully with 5GB of memory on the driver with these changes. 
Without these changes I was using 20GB and only had 500 reduce tasks.  The job 
has 50mb of serialized map output statuses and took roughly the same amount of 
time for the executors to get the map output statuses as before.

Ran a variety of other jobs, from large wordcounts to small ones not using 
broadcasts.

Author: Thomas Graves 

Closes #12113 from tgravescs/SPARK-1239.

(cherry picked from commit cc95f1ed5fdf2566bcefe8d10116eee544cf9184)
Signed-off-by: Davies Liu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d98dd72e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d98dd72e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d98dd72e

Branch: refs/heads/branch-2.0
Commit: d98dd72e7baeb59eacec4fefd66397513a607b2f
Parents: dc1562e
Author: Thomas Graves 
Authored: Fri May 6 19:31:26 2016 -0700
Committer: Davies Liu 
Committed: Fri May 6 19:31:35 2016 -0700

--
 .../org/apache/spark/MapOutputTracker.scala | 250 +++
 .../main/scala/org/apache/spark/SparkEnv.scala  |   6 +-
 .../apache/spark/MapOutputTrackerSuite.scala|  99 +---
 .../spark/scheduler/DAGSchedulerSuite.scala |   7 +-
 .../storage/BlockManagerReplicationSuite.scala  |   4 +-
 .../spark/storage/BlockManagerSuite.scala   |   4 +-
 .../streaming/ReceivedBlockHandlerSuite.scala   |   4 +-
 7 files changed, 290 insertions(+), 84 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d98dd72e/core/src/main/scala/org/apache/spark/MapOutputTracker.scala
--
diff --git a/core/src/main/scala/org/apache/spark/MapOutputTracker.scala 
b/core/src/main/scala/org/apache/spark/MapOutputTracker.scala
index 3a5caa3..6bd9502 100644
--- a/core/src/main/scala/org/apache/spark/MapOutputTracker.scala
+++ b/core/src/main/scala/org/apache/spark/MapOutputTracker.scala
@@ -18,13 +18,15 @@
 package org.apache.spark
 
 import java.io._
-import 

spark git commit: [SPARK-1239] Improve fetching of map output statuses

2016-05-06 Thread davies
Repository: spark
Updated Branches:
  refs/heads/master f7b7ef416 -> cc95f1ed5


[SPARK-1239] Improve fetching of map output statuses

The main issue we are trying to solve is the memory bloat of the Driver when 
tasks request the map output statuses.  This means with a large number of tasks 
you either need a huge amount of memory on Driver or you have to repartition to 
smaller number.  This makes it really difficult to run over say 5 tasks.

The main issues that cause the memory bloat are:
1) no flow control on sending the map output status responses.  We serialize 
the map status output  and then hand off to netty to send.  netty is sending 
asynchronously and it can't send them fast enough to keep up with incoming 
requests so we end up with lots of copies of the serialized map output statuses 
sitting there and this causes huge bloat when you have 10's of thousands of 
tasks and map output status is in the 10's of MB.
2) When initial reduce tasks are started up, they all request the map output 
statuses from the Driver. These requests are handled by multiple threads in 
parallel so even though we check to see if we have a cached version, initially 
when we don't have a cached version yet, many of initial requests can all end 
up serializing the exact same map output statuses.

This patch does a couple of things:
- When the map output status size is over a threshold (default 512K) then it 
uses broadcast to send the map statuses.  This means we no longer serialize a 
large map output status and thus we don't have issues with memory bloat.  the 
messages sizes are now in the 300-400 byte range and the map status output are 
broadcast. If its under the threadshold it sends it as before, the message 
contains the DIRECT indicator now.
- synchronize the incoming requests to allow one thread to cache the serialized 
output and broadcast the map output status  that can then be used by everyone 
else.  This ensures we don't create multiple broadcast variables when we don't 
need to.  To ensure this happens I added a second thread pool which the 
Dispatcher hands the requests to so that those threads can block without 
blocking the main dispatcher threads (which would cause things like heartbeats 
and such not to come through)

Note that some of design and code was contributed by mridulm

## How was this patch tested?

Unit tests and a lot of manually testing.
Ran with akka and netty rpc. Ran with both dynamic allocation on and off.

one of the large jobs I used to test this was a join of 15TB of data.  it had 
200,000 map tasks, and  20,000 reduce tasks. Executors ranged from 200 to 2000. 
 This job ran successfully with 5GB of memory on the driver with these changes. 
Without these changes I was using 20GB and only had 500 reduce tasks.  The job 
has 50mb of serialized map output statuses and took roughly the same amount of 
time for the executors to get the map output statuses as before.

Ran a variety of other jobs, from large wordcounts to small ones not using 
broadcasts.

Author: Thomas Graves 

Closes #12113 from tgravescs/SPARK-1239.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cc95f1ed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cc95f1ed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cc95f1ed

Branch: refs/heads/master
Commit: cc95f1ed5fdf2566bcefe8d10116eee544cf9184
Parents: f7b7ef4
Author: Thomas Graves 
Authored: Fri May 6 19:31:26 2016 -0700
Committer: Davies Liu 
Committed: Fri May 6 19:31:26 2016 -0700

--
 .../org/apache/spark/MapOutputTracker.scala | 250 +++
 .../main/scala/org/apache/spark/SparkEnv.scala  |   6 +-
 .../apache/spark/MapOutputTrackerSuite.scala|  99 +---
 .../spark/scheduler/DAGSchedulerSuite.scala |   7 +-
 .../storage/BlockManagerReplicationSuite.scala  |   4 +-
 .../spark/storage/BlockManagerSuite.scala   |   4 +-
 .../streaming/ReceivedBlockHandlerSuite.scala   |   4 +-
 7 files changed, 290 insertions(+), 84 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cc95f1ed/core/src/main/scala/org/apache/spark/MapOutputTracker.scala
--
diff --git a/core/src/main/scala/org/apache/spark/MapOutputTracker.scala 
b/core/src/main/scala/org/apache/spark/MapOutputTracker.scala
index 3a5caa3..6bd9502 100644
--- a/core/src/main/scala/org/apache/spark/MapOutputTracker.scala
+++ b/core/src/main/scala/org/apache/spark/MapOutputTracker.scala
@@ -18,13 +18,15 @@
 package org.apache.spark
 
 import java.io._
-import java.util.concurrent.ConcurrentHashMap
+import java.util.concurrent.{ConcurrentHashMap, LinkedBlockingQueue, 
ThreadPoolExecutor}
 

spark git commit: [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths

2016-05-06 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 22f9f5f97 -> dc1562e97


[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there 
is no partitioning scheme in the given paths

## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior 
in Spark 1.6.1. However in current master, all the 4 files are read.

The fix is to make FileCatalog return only the children files of the given path 
if there is not partitioning detected (instead of all the recursive list of 
files).

Closes #12774

## How was this patch tested?

unit tests

Author: Tathagata Das 

Closes #12856 from tdas/SPARK-14997.

(cherry picked from commit f7b7ef41662d7d02fc4f834f3c6c4ee8802e949c)
Signed-off-by: Yin Huai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc1562e9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc1562e9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc1562e9

Branch: refs/heads/branch-2.0
Commit: dc1562e97d570238f8532b3f8051e8df90722732
Parents: 22f9f5f
Author: Tathagata Das 
Authored: Fri May 6 15:04:16 2016 -0700
Committer: Yin Huai 
Committed: Fri May 6 15:04:27 2016 -0700

--
 .../PartitioningAwareFileCatalog.scala  |  24 +-
 .../datasources/FileCatalogSuite.scala  |  68 ++
 .../ParquetPartitionDiscoverySuite.scala|  47 
 .../sql/streaming/FileStreamSourceSuite.scala   |  15 +-
 .../sql/sources/HadoopFsRelationTest.scala  | 232 +--
 5 files changed, 356 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dc1562e9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
index 2c44b39..5f04a6c 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
@@ -61,7 +61,29 @@ abstract class PartitioningAwareFileCatalog(
 }
   }
 
-  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
+  override def allFiles(): Seq[FileStatus] = {
+if (partitionSpec().partitionColumns.isEmpty) {
+  // For each of the input paths, get the list of files inside them
+  paths.flatMap { path =>
+// Make the path qualified (consistent with listLeafFiles and 
listLeafFilesInParallel).
+val fs = path.getFileSystem(hadoopConf)
+val qualifiedPath = fs.makeQualified(path)
+
+// There are three cases possible with each path
+// 1. The path is a directory and has children files in it. Then it 
must be present in
+//leafDirToChildrenFiles as those children files will have been 
found as leaf files.
+//Find its children files from leafDirToChildrenFiles and include 
them.
+// 2. The path is a file, then it will be present in leafFiles. 
Include this path.
+// 3. The path is a directory, but has no children files. Do not 
include this path.
+
+leafDirToChildrenFiles.get(qualifiedPath)
+  .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
+  .getOrElse(Array.empty)
+  }
+} else {
+  leafFiles.values.toSeq
+}
+  }
 
   protected def inferPartitioning(): PartitionSpec = {
 // We use leaf dirs containing data files to discover the schema.

http://git-wip-us.apache.org/repos/asf/spark/blob/dc1562e9/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
new file mode 100644
index 000..dab5c76
--- /dev/null
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright 

spark git commit: [SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths

2016-05-06 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master e20cd9f4c -> f7b7ef416


[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there 
is no partitioning scheme in the given paths

## What changes were proposed in this pull request?
Lets says there are json files in the following directories structure
```
xyz/file0.json
xyz/subdir1/file1.json
xyz/subdir2/file2.json
xyz/subdir1/subsubdir1/file3.json
```
`sqlContext.read.json("xyz")` should read only file0.json according to behavior 
in Spark 1.6.1. However in current master, all the 4 files are read.

The fix is to make FileCatalog return only the children files of the given path 
if there is not partitioning detected (instead of all the recursive list of 
files).

Closes #12774

## How was this patch tested?

unit tests

Author: Tathagata Das 

Closes #12856 from tdas/SPARK-14997.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f7b7ef41
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f7b7ef41
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f7b7ef41

Branch: refs/heads/master
Commit: f7b7ef41662d7d02fc4f834f3c6c4ee8802e949c
Parents: e20cd9f
Author: Tathagata Das 
Authored: Fri May 6 15:04:16 2016 -0700
Committer: Yin Huai 
Committed: Fri May 6 15:04:16 2016 -0700

--
 .../PartitioningAwareFileCatalog.scala  |  24 +-
 .../datasources/FileCatalogSuite.scala  |  68 ++
 .../ParquetPartitionDiscoverySuite.scala|  47 
 .../sql/streaming/FileStreamSourceSuite.scala   |  15 +-
 .../sql/sources/HadoopFsRelationTest.scala  | 232 +--
 5 files changed, 356 insertions(+), 30 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f7b7ef41/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
index 2c44b39..5f04a6c 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala
@@ -61,7 +61,29 @@ abstract class PartitioningAwareFileCatalog(
 }
   }
 
-  override def allFiles(): Seq[FileStatus] = leafFiles.values.toSeq
+  override def allFiles(): Seq[FileStatus] = {
+if (partitionSpec().partitionColumns.isEmpty) {
+  // For each of the input paths, get the list of files inside them
+  paths.flatMap { path =>
+// Make the path qualified (consistent with listLeafFiles and 
listLeafFilesInParallel).
+val fs = path.getFileSystem(hadoopConf)
+val qualifiedPath = fs.makeQualified(path)
+
+// There are three cases possible with each path
+// 1. The path is a directory and has children files in it. Then it 
must be present in
+//leafDirToChildrenFiles as those children files will have been 
found as leaf files.
+//Find its children files from leafDirToChildrenFiles and include 
them.
+// 2. The path is a file, then it will be present in leafFiles. 
Include this path.
+// 3. The path is a directory, but has no children files. Do not 
include this path.
+
+leafDirToChildrenFiles.get(qualifiedPath)
+  .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
+  .getOrElse(Array.empty)
+  }
+} else {
+  leafFiles.values.toSeq
+}
+  }
 
   protected def inferPartitioning(): PartitionSpec = {
 // We use leaf dirs containing data files to discover the schema.

http://git-wip-us.apache.org/repos/asf/spark/blob/f7b7ef41/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
new file mode 100644
index 000..dab5c76
--- /dev/null
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileCatalogSuite.scala
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this 

spark git commit: [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover

2016-05-06 Thread meng
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 1e6b158b1 -> 22f9f5f97


[SPARK-14050][ML] Add multiple languages support and additional methods for 
Stop Words Remover

## What changes were proposed in this pull request?

This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc

## How was this patch tested?

Unit tests.

Closes #11871

cc: burakkose srowen

Author: Burak Köse 
Author: Xiangrui Meng 
Author: Burak KOSE 

Closes #12843 from mengxr/SPARK-14050.

(cherry picked from commit e20cd9f4ce977739ce80a2c39f8ebae5e53f72f6)
Signed-off-by: Xiangrui Meng 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/22f9f5f9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/22f9f5f9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/22f9f5f9

Branch: refs/heads/branch-2.0
Commit: 22f9f5f97221a128f7a91d347fa2ace7de7045aa
Parents: 1e6b158
Author: Burak Köse 
Authored: Fri May 6 13:58:12 2016 -0700
Committer: Xiangrui Meng 
Committed: Fri May 6 13:58:24 2016 -0700

--
 licenses/LICENSE-postgresql.txt |  24 ++
 .../apache/spark/ml/feature/stopwords/README|  12 +
 .../spark/ml/feature/stopwords/danish.txt   |  94 ++
 .../apache/spark/ml/feature/stopwords/dutch.txt | 101 ++
 .../spark/ml/feature/stopwords/english.txt  | 153 +
 .../spark/ml/feature/stopwords/finnish.txt  | 235 ++
 .../spark/ml/feature/stopwords/french.txt   | 155 +
 .../spark/ml/feature/stopwords/german.txt   | 231 ++
 .../spark/ml/feature/stopwords/hungarian.txt| 199 
 .../spark/ml/feature/stopwords/italian.txt  | 279 +
 .../spark/ml/feature/stopwords/norwegian.txt| 176 +++
 .../spark/ml/feature/stopwords/portuguese.txt   | 203 
 .../spark/ml/feature/stopwords/russian.txt  | 151 +
 .../spark/ml/feature/stopwords/spanish.txt  | 313 +++
 .../spark/ml/feature/stopwords/swedish.txt  | 114 +++
 .../spark/ml/feature/stopwords/turkish.txt  |  53 
 .../spark/ml/feature/StopWordsRemover.scala | 106 +++
 .../ml/feature/StopWordsRemoverSuite.scala  |  57 +++-
 python/pyspark/ml/feature.py|  38 ++-
 python/pyspark/ml/tests.py  |   7 +
 20 files changed, 2614 insertions(+), 87 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/22f9f5f9/licenses/LICENSE-postgresql.txt
--
diff --git a/licenses/LICENSE-postgresql.txt b/licenses/LICENSE-postgresql.txt
new file mode 100644
index 000..515bf9a
--- /dev/null
+++ b/licenses/LICENSE-postgresql.txt
@@ -0,0 +1,24 @@
+PostgreSQL Database Management System
+(formerly known as Postgres, then as Postgres95)
+
+Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+
+Portions Copyright (c) 1994, The Regents of the University of California
+
+Permission to use, copy, modify, and distribute this software and its
+documentation for any purpose, without fee, and without a written agreement
+is hereby granted, provided that the above copyright notice and this
+paragraph and the following two paragraphs appear in all copies.
+
+IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
+DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+
+THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES,
+INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO
+PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+

http://git-wip-us.apache.org/repos/asf/spark/blob/22f9f5f9/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README
--
diff --git 
a/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README 
b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README
new file mode 100755
index 000..ec08a50
--- /dev/null
+++ b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README
@@ -0,0 +1,12 @@
+Stopwords Corpus
+
+This corpus contains lists of stop words for several languages.  These
+are high-frequency grammatical words 

spark git commit: [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover

2016-05-06 Thread meng
Repository: spark
Updated Branches:
  refs/heads/master 5c8fad7b9 -> e20cd9f4c


[SPARK-14050][ML] Add multiple languages support and additional methods for 
Stop Words Remover

## What changes were proposed in this pull request?

This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc

## How was this patch tested?

Unit tests.

Closes #11871

cc: burakkose srowen

Author: Burak Köse 
Author: Xiangrui Meng 
Author: Burak KOSE 

Closes #12843 from mengxr/SPARK-14050.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e20cd9f4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e20cd9f4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e20cd9f4

Branch: refs/heads/master
Commit: e20cd9f4ce977739ce80a2c39f8ebae5e53f72f6
Parents: 5c8fad7
Author: Burak Köse 
Authored: Fri May 6 13:58:12 2016 -0700
Committer: Xiangrui Meng 
Committed: Fri May 6 13:58:12 2016 -0700

--
 licenses/LICENSE-postgresql.txt |  24 ++
 .../apache/spark/ml/feature/stopwords/README|  12 +
 .../spark/ml/feature/stopwords/danish.txt   |  94 ++
 .../apache/spark/ml/feature/stopwords/dutch.txt | 101 ++
 .../spark/ml/feature/stopwords/english.txt  | 153 +
 .../spark/ml/feature/stopwords/finnish.txt  | 235 ++
 .../spark/ml/feature/stopwords/french.txt   | 155 +
 .../spark/ml/feature/stopwords/german.txt   | 231 ++
 .../spark/ml/feature/stopwords/hungarian.txt| 199 
 .../spark/ml/feature/stopwords/italian.txt  | 279 +
 .../spark/ml/feature/stopwords/norwegian.txt| 176 +++
 .../spark/ml/feature/stopwords/portuguese.txt   | 203 
 .../spark/ml/feature/stopwords/russian.txt  | 151 +
 .../spark/ml/feature/stopwords/spanish.txt  | 313 +++
 .../spark/ml/feature/stopwords/swedish.txt  | 114 +++
 .../spark/ml/feature/stopwords/turkish.txt  |  53 
 .../spark/ml/feature/StopWordsRemover.scala | 106 +++
 .../ml/feature/StopWordsRemoverSuite.scala  |  57 +++-
 python/pyspark/ml/feature.py|  38 ++-
 python/pyspark/ml/tests.py  |   7 +
 20 files changed, 2614 insertions(+), 87 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e20cd9f4/licenses/LICENSE-postgresql.txt
--
diff --git a/licenses/LICENSE-postgresql.txt b/licenses/LICENSE-postgresql.txt
new file mode 100644
index 000..515bf9a
--- /dev/null
+++ b/licenses/LICENSE-postgresql.txt
@@ -0,0 +1,24 @@
+PostgreSQL Database Management System
+(formerly known as Postgres, then as Postgres95)
+
+Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
+
+Portions Copyright (c) 1994, The Regents of the University of California
+
+Permission to use, copy, modify, and distribute this software and its
+documentation for any purpose, without fee, and without a written agreement
+is hereby granted, provided that the above copyright notice and this
+paragraph and the following two paragraphs appear in all copies.
+
+IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
+DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+
+THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES,
+INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO
+PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+

http://git-wip-us.apache.org/repos/asf/spark/blob/e20cd9f4/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README
--
diff --git 
a/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README 
b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README
new file mode 100755
index 000..ec08a50
--- /dev/null
+++ b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/README
@@ -0,0 +1,12 @@
+Stopwords Corpus
+
+This corpus contains lists of stop words for several languages.  These
+are high-frequency grammatical words which are usually ignored in text
+retrieval applications.
+
+They were obtained from:

spark git commit: [SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor Thread

2016-05-06 Thread andrewor14
Repository: spark
Updated Branches:
  refs/heads/branch-1.6 a3aa22a59 -> ab006523b


[SPARK-13566][CORE] Avoid deadlock between BlockManager and Executor Thread

Temp patch for branch 1.6, avoid deadlock between BlockManager and Executor 
Thread.

Author: cenyuhai 

Closes #11546 from cenyuhai/SPARK-13566.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab006523
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab006523
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab006523

Branch: refs/heads/branch-1.6
Commit: ab006523b840b1d2dbf3f5ff0a238558e7665a1e
Parents: a3aa22a
Author: cenyuhai 
Authored: Fri May 6 13:50:49 2016 -0700
Committer: Andrew Or 
Committed: Fri May 6 13:50:49 2016 -0700

--
 .../org/apache/spark/executor/Executor.scala|  12 ++
 .../org/apache/spark/storage/BlockManager.scala | 192 ---
 .../spark/storage/BlockManagerSuite.scala   |  38 
 3 files changed, 170 insertions(+), 72 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ab006523/core/src/main/scala/org/apache/spark/executor/Executor.scala
--
diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala 
b/core/src/main/scala/org/apache/spark/executor/Executor.scala
index ab5bde5..b248e12 100644
--- a/core/src/main/scala/org/apache/spark/executor/Executor.scala
+++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala
@@ -218,6 +218,7 @@ private[spark] class Executor(
   threwException = false
   res
 } finally {
+  val releasedLocks = env.blockManager.releaseAllLocksForTask(taskId)
   val freedMemory = taskMemoryManager.cleanUpAllAllocatedMemory()
   if (freedMemory > 0) {
 val errMsg = s"Managed memory leak detected; size = $freedMemory 
bytes, TID = $taskId"
@@ -227,6 +228,17 @@ private[spark] class Executor(
   logError(errMsg)
 }
   }
+
+  if (releasedLocks.nonEmpty) {
+val errMsg =
+  s"${releasedLocks.size} block locks were not released by TID = 
$taskId:\n" +
+  releasedLocks.mkString("[", ", ", "]")
+if (conf.getBoolean("spark.storage.exceptionOnPinLeak", false) && 
!threwException) {
+  throw new SparkException(errMsg)
+} else {
+  logError(errMsg)
+}
+  }
 }
 val taskFinish = System.currentTimeMillis()
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ab006523/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
--
diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
index 538272d..288f756 100644
--- a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
+++ b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
@@ -19,12 +19,14 @@ package org.apache.spark.storage
 
 import java.io._
 import java.nio.{ByteBuffer, MappedByteBuffer}
+import java.util.concurrent.ConcurrentHashMap
 
 import scala.collection.mutable.{ArrayBuffer, HashMap}
 import scala.concurrent.duration._
 import scala.concurrent.{Await, ExecutionContext, Future}
 import scala.util.Random
 import scala.util.control.NonFatal
+import scala.collection.JavaConverters._
 
 import sun.nio.ch.DirectBuffer
 
@@ -65,7 +67,7 @@ private[spark] class BlockManager(
 val master: BlockManagerMaster,
 defaultSerializer: Serializer,
 val conf: SparkConf,
-memoryManager: MemoryManager,
+val memoryManager: MemoryManager,
 mapOutputTracker: MapOutputTracker,
 shuffleManager: ShuffleManager,
 blockTransferService: BlockTransferService,
@@ -164,6 +166,11 @@ private[spark] class BlockManager(
* loaded yet. */
   private lazy val compressionCodec: CompressionCodec = 
CompressionCodec.createCodec(conf)
 
+  // Blocks are removing by another thread
+  val pendingToRemove = new ConcurrentHashMap[BlockId, Long]()
+
+  private val NON_TASK_WRITER = -1024L
+
   /**
* Initializes the BlockManager with the given appId. This is not performed 
in the constructor as
* the appId may not be known at BlockManager instantiation time (in 
particular for the driver,
@@ -1025,54 +1032,58 @@ private[spark] class BlockManager(
 val info = blockInfo.get(blockId).orNull
 
 // If the block has not already been dropped
-if (info != null) {
-  info.synchronized {
-// required ? As of now, this will be invoked only for blocks which 
are ready
-// But in case this changes in future, adding for 

spark git commit: [SPARK-15108][SQL] Describe Permanent UDTF

2016-05-06 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master 76ad04d9a -> 5c8fad7b9


[SPARK-15108][SQL] Describe Permanent UDTF

 What changes were proposed in this pull request?
When Describe a UDTF, the command returns a wrong result. The command is unable 
to find the function, which has been created and cataloged in the catalog but 
not in the functionRegistry.

This PR is to correct it. If the function is not in the functionRegistry, we 
will check the catalog for collecting the information of the UDTF function.

 How was this patch tested?
Added test cases to verify the results

Author: gatorsmile 

Closes #12885 from gatorsmile/showFunction.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c8fad7b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c8fad7b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c8fad7b

Branch: refs/heads/master
Commit: 5c8fad7b9bfd6677111a8e27e2574f82b04ec479
Parents: 76ad04d
Author: gatorsmile 
Authored: Fri May 6 11:43:07 2016 -0700
Committer: Yin Huai 
Committed: Fri May 6 11:43:07 2016 -0700

--
 .../catalyst/analysis/NoSuchItemException.scala |  7 ++-
 .../sql/catalyst/catalog/SessionCatalog.scala   |  8 ++-
 .../spark/sql/catalyst/parser/AstBuilder.scala  |  6 +--
 .../sql/catalyst/plans/logical/commands.scala   |  4 +-
 .../analysis/UnsupportedOperationsSuite.scala   |  3 +-
 .../sql/catalyst/parser/PlanParserSuite.scala   | 12 +++--
 .../spark/sql/execution/command/functions.scala | 20 
 .../org/apache/spark/sql/SQLQuerySuite.scala|  2 +-
 .../thriftserver/HiveThriftServer2Suites.scala  |  4 +-
 .../spark/sql/hive/client/HiveClient.scala  |  2 +-
 .../sql/hive/execution/SQLQuerySuite.scala  | 54 ++--
 11 files changed, 91 insertions(+), 31 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5c8fad7b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
index 2412ec4..ff13bce 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
@@ -37,5 +37,10 @@ class NoSuchPartitionException(
   extends AnalysisException(
 s"Partition not found in table '$table' database '$db':\n" + 
spec.mkString("\n"))
 
-class NoSuchFunctionException(db: String, func: String)
+class NoSuchPermanentFunctionException(db: String, func: String)
   extends AnalysisException(s"Function '$func' not found in database '$db'")
+
+class NoSuchFunctionException(db: String, func: String)
+  extends AnalysisException(
+s"Undefined function: '$func'. This function is neither a registered 
temporary function nor " +
+s"a permanent function registered in the database '$db'.")

http://git-wip-us.apache.org/repos/asf/spark/blob/5c8fad7b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
index 7127707..9918bce 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
@@ -28,7 +28,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.{CatalystConf, SimpleCatalystConf}
 import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier}
-import org.apache.spark.sql.catalyst.analysis.{FunctionRegistry, 
NoSuchFunctionException, SimpleFunctionRegistry}
+import org.apache.spark.sql.catalyst.analysis.{FunctionRegistry, 
NoSuchFunctionException, NoSuchPermanentFunctionException, 
SimpleFunctionRegistry}
 import org.apache.spark.sql.catalyst.analysis.FunctionRegistry.FunctionBuilder
 import org.apache.spark.sql.catalyst.expressions.{Expression, ExpressionInfo}
 import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, SubqueryAlias}
@@ -644,9 +644,7 @@ class SessionCatalog(
   }
 
   protected def failFunctionLookup(name: String): Nothing = {
-throw new AnalysisException(s"Undefined function: '$name'. This function 
is " +
-  s"neither a registered temporary function nor " +
-  s"a 

spark git commit: [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC

2016-05-06 Thread lian
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 3f6a13c8a -> d7c755561


[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14962

ORC filters were being pushed down for all types for both `IsNull` and 
`IsNotNull`.

This is apparently OK because both `IsNull` and `IsNotNull` do not take a type 
as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in 
Spark-side but they do not filter correctly because stored statistics always 
produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is 
always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. 
(Please see 
[RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318)
  and 
[RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365)
 in Hive 1.2)

This looks prevented in Hive 1.3.x >= by forcing to give a type 
([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56))
 when building a filter 
([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260))
 but Hive 1.2.x seems not doing this.

This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on 
unsupported types. `OrcFilters` resembles `ParquetFilters`.

## How was this patch tested?

Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`.

Author: hyukjinkwon 
Author: Hyukjin Kwon 

Closes #12777 from HyukjinKwon/SPARK-14962.

(cherry picked from commit fa928ff9a3c1de5d5aff9d14e6bc1bd03fcca087)
Signed-off-by: Cheng Lian 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7c75556
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7c75556
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7c75556

Branch: refs/heads/branch-2.0
Commit: d7c755561270ee8ec1c44df2e10a8bcb4985c3de
Parents: 3f6a13c
Author: hyukjinkwon 
Authored: Sat May 7 01:46:45 2016 +0800
Committer: Cheng Lian 
Committed: Sat May 7 01:53:08 2016 +0800

--
 .../apache/spark/sql/test/SQLTestUtils.scala|  2 +-
 .../apache/spark/sql/hive/orc/OrcFilters.scala  | 63 
 .../apache/spark/sql/hive/orc/OrcRelation.scala | 19 ++---
 .../spark/sql/hive/orc/OrcFilterSuite.scala | 75 
 .../spark/sql/hive/orc/OrcQuerySuite.scala  | 14 
 .../spark/sql/hive/orc/OrcSourceSuite.scala |  9 ++-
 6 files changed, 126 insertions(+), 56 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d7c75556/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
index ffb206a..6d2b95e 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
@@ -213,7 +213,7 @@ private[sql] trait SQLTestUtils
*/
   protected def stripSparkFilter(df: DataFrame): DataFrame = {
 val schema = df.schema
-val withoutFilters = df.queryExecution.sparkPlan transform {
+val withoutFilters = df.queryExecution.sparkPlan.transform {
   case FilterExec(_, child) => child
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d7c75556/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
index c025c12..c463bc8 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
@@ -17,13 +17,12 @@
 
 package org.apache.spark.sql.hive.orc
 
-import org.apache.hadoop.hive.common.`type`.{HiveChar, HiveDecimal, 
HiveVarchar}
 import org.apache.hadoop.hive.ql.io.sarg.{SearchArgument, 
SearchArgumentFactory}
 import org.apache.hadoop.hive.ql.io.sarg.SearchArgument.Builder
-import org.apache.hadoop.hive.serde2.io.DateWritable
 
 import org.apache.spark.internal.Logging
 import 

spark git commit: [SPARK-14512] [DOC] Add python example for QuantileDiscretizer

2016-05-06 Thread davies
Repository: spark
Updated Branches:
  refs/heads/master fa928ff9a -> 76ad04d9a


[SPARK-14512] [DOC] Add python example for QuantileDiscretizer

## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer

## How was this patch tested?
manual tests

Author: Zheng RuiFeng 

Closes #12281 from zhengruifeng/discret_pe.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/76ad04d9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/76ad04d9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/76ad04d9

Branch: refs/heads/master
Commit: 76ad04d9a0a7d4dfb762318d9c7be0d7720f4e1a
Parents: fa928ff
Author: Zheng RuiFeng 
Authored: Fri May 6 10:47:13 2016 -0700
Committer: Davies Liu 
Committed: Fri May 6 10:47:13 2016 -0700

--
 docs/ml-features.md |  9 +
 .../python/ml/quantile_discretizer_example.py   | 39 
 2 files changed, 48 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/76ad04d9/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 0b8f2d7..237e93a 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1118,6 +1118,15 @@ for more details on the API.
 
 {% include_example 
java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java %}
 
+
+
+
+Refer to the [QuantileDiscretizer Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.QuantileDiscretizer)
+for more details on the API.
+
+{% include_example python/ml/quantile_discretizer_example.py %}
+
+
 
 
 # Feature Selectors

http://git-wip-us.apache.org/repos/asf/spark/blob/76ad04d9/examples/src/main/python/ml/quantile_discretizer_example.py
--
diff --git a/examples/src/main/python/ml/quantile_discretizer_example.py 
b/examples/src/main/python/ml/quantile_discretizer_example.py
new file mode 100644
index 000..6ae7bb1
--- /dev/null
+++ b/examples/src/main/python/ml/quantile_discretizer_example.py
@@ -0,0 +1,39 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import QuantileDiscretizer
+# $example off$
+from pyspark.sql import SparkSession
+
+
+if __name__ == "__main__":
+spark = 
SparkSession.builder.appName("PythonQuantileDiscretizerExample").getOrCreate()
+
+# $example on$
+data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)]
+dataFrame = spark.createDataFrame(data, ["id", "hour"])
+
+discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", 
outputCol="result")
+
+result = discretizer.fit(dataFrame).transform(dataFrame)
+result.show()
+# $example off$
+
+spark.stop()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-14512] [DOC] Add python example for QuantileDiscretizer

2016-05-06 Thread davies
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 1ee621b1d -> 3f6a13c8a


[SPARK-14512] [DOC] Add python example for QuantileDiscretizer

## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer

## How was this patch tested?
manual tests

Author: Zheng RuiFeng 

Closes #12281 from zhengruifeng/discret_pe.

(cherry picked from commit 76ad04d9a0a7d4dfb762318d9c7be0d7720f4e1a)
Signed-off-by: Davies Liu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3f6a13c8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3f6a13c8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3f6a13c8

Branch: refs/heads/branch-2.0
Commit: 3f6a13c8a49e15c5f88415837f49f8e81092177b
Parents: 1ee621b
Author: Zheng RuiFeng 
Authored: Fri May 6 10:47:13 2016 -0700
Committer: Davies Liu 
Committed: Fri May 6 10:47:36 2016 -0700

--
 docs/ml-features.md |  9 +
 .../python/ml/quantile_discretizer_example.py   | 39 
 2 files changed, 48 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3f6a13c8/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 0b8f2d7..237e93a 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1118,6 +1118,15 @@ for more details on the API.
 
 {% include_example 
java/org/apache/spark/examples/ml/JavaQuantileDiscretizerExample.java %}
 
+
+
+
+Refer to the [QuantileDiscretizer Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.QuantileDiscretizer)
+for more details on the API.
+
+{% include_example python/ml/quantile_discretizer_example.py %}
+
+
 
 
 # Feature Selectors

http://git-wip-us.apache.org/repos/asf/spark/blob/3f6a13c8/examples/src/main/python/ml/quantile_discretizer_example.py
--
diff --git a/examples/src/main/python/ml/quantile_discretizer_example.py 
b/examples/src/main/python/ml/quantile_discretizer_example.py
new file mode 100644
index 000..6ae7bb1
--- /dev/null
+++ b/examples/src/main/python/ml/quantile_discretizer_example.py
@@ -0,0 +1,39 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import QuantileDiscretizer
+# $example off$
+from pyspark.sql import SparkSession
+
+
+if __name__ == "__main__":
+spark = 
SparkSession.builder.appName("PythonQuantileDiscretizerExample").getOrCreate()
+
+# $example on$
+data = [(0, 18.0,), (1, 19.0,), (2, 8.0,), (3, 5.0,), (4, 2.2,)]
+dataFrame = spark.createDataFrame(data, ["id", "hour"])
+
+discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", 
outputCol="result")
+
+result = discretizer.fit(dataFrame).transform(dataFrame)
+result.show()
+# $example off$
+
+spark.stop()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC

2016-05-06 Thread lian
Repository: spark
Updated Branches:
  refs/heads/master a03c5e68a -> fa928ff9a


[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14962

ORC filters were being pushed down for all types for both `IsNull` and 
`IsNotNull`.

This is apparently OK because both `IsNull` and `IsNotNull` do not take a type 
as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in 
Spark-side but they do not filter correctly because stored statistics always 
produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is 
always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. 
(Please see 
[RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318)
  and 
[RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365)
 in Hive 1.2)

This looks prevented in Hive 1.3.x >= by forcing to give a type 
([`PredicateLeaf.Type`](https://github.com/apache/hive/blob/e085b7e9bd059d91aaf013df0db4d71dca90ec6f/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java#L50-L56))
 when building a filter 
([`SearchArgument`](https://github.com/apache/hive/blob/26b5c7b56a4f28ce3eabc0207566cce46b29b558/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java#L260))
 but Hive 1.2.x seems not doing this.

This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on 
unsupported types. `OrcFilters` resembles `ParquetFilters`.

## How was this patch tested?

Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`.

Author: hyukjinkwon 
Author: Hyukjin Kwon 

Closes #12777 from HyukjinKwon/SPARK-14962.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa928ff9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa928ff9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa928ff9

Branch: refs/heads/master
Commit: fa928ff9a3c1de5d5aff9d14e6bc1bd03fcca087
Parents: a03c5e6
Author: hyukjinkwon 
Authored: Sat May 7 01:46:45 2016 +0800
Committer: Cheng Lian 
Committed: Sat May 7 01:46:45 2016 +0800

--
 .../apache/spark/sql/test/SQLTestUtils.scala|  2 +-
 .../apache/spark/sql/hive/orc/OrcFilters.scala  | 63 
 .../apache/spark/sql/hive/orc/OrcRelation.scala | 19 ++---
 .../spark/sql/hive/orc/OrcFilterSuite.scala | 75 
 .../spark/sql/hive/orc/OrcQuerySuite.scala  | 14 
 .../spark/sql/hive/orc/OrcSourceSuite.scala |  9 ++-
 6 files changed, 126 insertions(+), 56 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fa928ff9/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
index ffb206a..6d2b95e 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
@@ -213,7 +213,7 @@ private[sql] trait SQLTestUtils
*/
   protected def stripSparkFilter(df: DataFrame): DataFrame = {
 val schema = df.schema
-val withoutFilters = df.queryExecution.sparkPlan transform {
+val withoutFilters = df.queryExecution.sparkPlan.transform {
   case FilterExec(_, child) => child
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/fa928ff9/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
index c025c12..c463bc8 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
@@ -17,13 +17,12 @@
 
 package org.apache.spark.sql.hive.orc
 
-import org.apache.hadoop.hive.common.`type`.{HiveChar, HiveDecimal, 
HiveVarchar}
 import org.apache.hadoop.hive.ql.io.sarg.{SearchArgument, 
SearchArgumentFactory}
 import org.apache.hadoop.hive.ql.io.sarg.SearchArgument.Builder
-import org.apache.hadoop.hive.serde2.io.DateWritable
 
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.sources._
+import org.apache.spark.sql.types._
 
 /**
  * Helper object for building ORC `SearchArgument`s, 

spark git commit: [SPARK-14738][BUILD] Separate docker integration tests from main build

2016-05-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 42f2ee6c5 -> 1ee621b1d


[SPARK-14738][BUILD] Separate docker integration tests from main build

## What changes were proposed in this pull request?

Create a maven profile for executing the docker integration tests using maven
Remove docker integration tests from main sbt build
Update documentation on how to run docker integration tests from sbt

## How was this patch tested?

Manual test of the docker integration tests as in :
mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile 
test

## Other comments

Note that the the DB2 Docker Tests are still disabled as there is a kernel 
version issue on the AMPLab Jenkins slaves and we would need to get them on the 
right level before enabling those tests. They do run ok locally with the 
updates from PR #12348

Author: Luciano Resende 

Closes #12508 from lresende/docker.

(cherry picked from commit a03c5e68abd8c066c97ebd33070d59dce1a7)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ee621b1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ee621b1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ee621b1

Branch: refs/heads/branch-2.0
Commit: 1ee621b1d949ce8e1bb41ef3fe19dfaad4a90ab1
Parents: 42f2ee6
Author: Luciano Resende 
Authored: Fri May 6 12:25:45 2016 +0100
Committer: Sean Owen 
Committed: Fri May 6 14:24:06 2016 +0100

--
 docs/building-spark.md  | 12 
 .../apache/spark/sql/jdbc/MySQLIntegrationSuite.scala   |  3 ---
 .../apache/spark/sql/jdbc/OracleIntegrationSuite.scala  |  5 +
 .../spark/sql/jdbc/PostgresIntegrationSuite.scala   |  3 ---
 pom.xml |  8 +++-
 project/SparkBuild.scala|  3 ++-
 6 files changed, 22 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1ee621b1/docs/building-spark.md
--
diff --git a/docs/building-spark.md b/docs/building-spark.md
index fec442a..13c95e4 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -190,6 +190,18 @@ or
 Java 8 tests are automatically enabled when a Java 8 JDK is detected.
 If you have JDK 8 installed but it is not the system default, you can set 
JAVA_HOME to point to JDK 8 before running the tests.
 
+# Running Docker based Integration Test Suites
+
+Running only docker based integration tests and nothing else.
+
+mvn install -DskipTests
+mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11
+
+or
+
+sbt docker-integration-tests/test
+
+
 # Packaging without Hadoop Dependencies for YARN
 
 The assembly directory produced by `mvn package` will, by default, include all 
of Spark's dependencies, including Hadoop and some of its ecosystem projects. 
On YARN deployments, this causes multiple versions of these to appear on 
executor classpaths: the version packaged in the Spark assembly and the version 
on each node, included with `yarn.application.classpath`.  The 
`hadoop-provided` profile builds the assembly without including 
Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.

http://git-wip-us.apache.org/repos/asf/spark/blob/1ee621b1/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
--
diff --git 
a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
 
b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
index aa47228..a70ed98 100644
--- 
a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
+++ 
b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
@@ -21,12 +21,9 @@ import java.math.BigDecimal
 import java.sql.{Connection, Date, Timestamp}
 import java.util.Properties
 
-import org.scalatest.Ignore
-
 import org.apache.spark.tags.DockerTest
 
 @DockerTest
-@Ignore
 class MySQLIntegrationSuite extends DockerJDBCIntegrationSuite {
   override val db = new DatabaseOnDocker {
 override val imageName = "mysql:5.7.9"

http://git-wip-us.apache.org/repos/asf/spark/blob/1ee621b1/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
--
diff --git 
a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 

spark git commit: [SPARK-14738][BUILD] Separate docker integration tests from main build

2016-05-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 157a49aa4 -> a03c5e68a


[SPARK-14738][BUILD] Separate docker integration tests from main build

## What changes were proposed in this pull request?

Create a maven profile for executing the docker integration tests using maven
Remove docker integration tests from main sbt build
Update documentation on how to run docker integration tests from sbt

## How was this patch tested?

Manual test of the docker integration tests as in :
mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 compile 
test

## Other comments

Note that the the DB2 Docker Tests are still disabled as there is a kernel 
version issue on the AMPLab Jenkins slaves and we would need to get them on the 
right level before enabling those tests. They do run ok locally with the 
updates from PR #12348

Author: Luciano Resende 

Closes #12508 from lresende/docker.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a03c5e68
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a03c5e68
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a03c5e68

Branch: refs/heads/master
Commit: a03c5e68abd8c066c97ebd33070d59dce1a7
Parents: 157a49a
Author: Luciano Resende 
Authored: Fri May 6 12:25:45 2016 +0100
Committer: Sean Owen 
Committed: Fri May 6 12:25:45 2016 +0100

--
 docs/building-spark.md  | 12 
 .../apache/spark/sql/jdbc/MySQLIntegrationSuite.scala   |  3 ---
 .../apache/spark/sql/jdbc/OracleIntegrationSuite.scala  |  5 +
 .../spark/sql/jdbc/PostgresIntegrationSuite.scala   |  3 ---
 pom.xml |  8 +++-
 project/SparkBuild.scala|  3 ++-
 6 files changed, 22 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a03c5e68/docs/building-spark.md
--
diff --git a/docs/building-spark.md b/docs/building-spark.md
index fec442a..13c95e4 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -190,6 +190,18 @@ or
 Java 8 tests are automatically enabled when a Java 8 JDK is detected.
 If you have JDK 8 installed but it is not the system default, you can set 
JAVA_HOME to point to JDK 8 before running the tests.
 
+# Running Docker based Integration Test Suites
+
+Running only docker based integration tests and nothing else.
+
+mvn install -DskipTests
+mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11
+
+or
+
+sbt docker-integration-tests/test
+
+
 # Packaging without Hadoop Dependencies for YARN
 
 The assembly directory produced by `mvn package` will, by default, include all 
of Spark's dependencies, including Hadoop and some of its ecosystem projects. 
On YARN deployments, this causes multiple versions of these to appear on 
executor classpaths: the version packaged in the Spark assembly and the version 
on each node, included with `yarn.application.classpath`.  The 
`hadoop-provided` profile builds the assembly without including 
Hadoop-ecosystem projects, like ZooKeeper and Hadoop itself.

http://git-wip-us.apache.org/repos/asf/spark/blob/a03c5e68/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
--
diff --git 
a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
 
b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
index aa47228..a70ed98 100644
--- 
a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
+++ 
b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala
@@ -21,12 +21,9 @@ import java.math.BigDecimal
 import java.sql.{Connection, Date, Timestamp}
 import java.util.Properties
 
-import org.scalatest.Ignore
-
 import org.apache.spark.tags.DockerTest
 
 @DockerTest
-@Ignore
 class MySQLIntegrationSuite extends DockerJDBCIntegrationSuite {
   override val db = new DatabaseOnDocker {
 override val imageName = "mysql:5.7.9"

http://git-wip-us.apache.org/repos/asf/spark/blob/a03c5e68/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
--
diff --git 
a/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 
b/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
index 357866b..c5e1f86 100644

spark git commit: [SPARK-14915] Fix incorrect resolution of merge conflict in commit bf3c0608f1779b4dd837b8289ec1d4516e145aea

2016-05-06 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-1.6 bf3c0608f -> a3aa22a59


[SPARK-14915] Fix incorrect resolution of merge conflict in commit 
bf3c0608f1779b4dd837b8289ec1d4516e145aea

## What changes were proposed in this pull request?

I botched the back-port of SPARK-14915 to branch-1.6 in 
https://github.com/apache/spark/commit/bf3c0608f1779b4dd837b8289ec1d4516e145aea 
resulting in a code block being added twice. This simply removes it, such that 
the net change is the intended one.

## How was this patch tested?

Jenkins tests. (This in theory has already been tested.)

Author: Sean Owen 

Closes #12950 from srowen/SPARK-14915.2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a3aa22a5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a3aa22a5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a3aa22a5

Branch: refs/heads/branch-1.6
Commit: a3aa22a5915c2cc6bdd6810227a3698c59823009
Parents: bf3c060
Author: Sean Owen 
Authored: Fri May 6 12:21:25 2016 +0100
Committer: Sean Owen 
Committed: Fri May 6 12:21:25 2016 +0100

--
 .../scala/org/apache/spark/scheduler/TaskSetManager.scala   | 9 -
 1 file changed, 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a3aa22a5/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala 
b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
index 3ca701d..77a8a19 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
@@ -730,15 +730,6 @@ private[spark] class TaskSetManager(
   addPendingTask(index)
 }
 
-if (successful(index)) {
-  logInfo(
-s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " +
-"but another instance of the task has already succeeded, " +
-"so not re-queuing the task to be re-executed.")
-} else {
-  addPendingTask(index)
-}
-
 if (!isZombie && state != TaskState.KILLED
 && reason.isInstanceOf[TaskFailedReason]
 && reason.asInstanceOf[TaskFailedReason].countTowardsTaskFailures) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org