date:20171018

[GitHub] spark pull request #19512: [SPARK-21551][Python] Increase timeout for Python...

2017-10-18 Thread FRosner

Github user FRosner closed the pull request at:

https://github.com/apache/spark/pull/19512


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-18 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19509
  
LGTM, merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18747: [SPARK-20822][SQL] Generate code to directly get value f...

2017-10-18 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18747
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-18 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145611544
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_type, 
_cast_pandas_series_type
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+
+if schema is None or isinstance(schema, list):
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# There will be at least 1 batch after slicing the 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+
+# If passed schema as a list of names then rename fields
+if isinstance(schema, list):
+fields = []
+for i, field in enumerate(schema_from_arrow):
+field.name = schema[i]
+fields.append(field)
+schema = StructType(fields)
+else:
+schema = schema_from_arrow
+else:
+batches = []
+for i, pdf_slice in enumerate(pdf_slices):
+
+# convert to series to pyarrow.Arrays to use mask when 
creating Arrow batches
+arrs = []
+names = []
+for c, (_, series) in enumerate(pdf_slice.iteritems()):
+field = schema[c]
+names.append(field.name)
+t = to_arrow_type(field.dataType)
+try:
+# NOTE: casting is not necessary with Arrow >= 0.7
+
arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t),
+ 
mask=series.isnull(), type=t))
+except ValueError as e:
+warnings.warn("Arrow will not be used in 
createDataFrame: %s" % str(e))
+return None
+batches.append(pa.RecordBatch.from_arrays(arrs, names))
+
+# Verify schema of first batch, return None if not equal 
and fallback without Arrow
+if i == 0:
+schema_from_arrow = 
from_arrow_schema(batches[i].schema)
+if schema != schema_from_arrow:
+warnings.warn("Arrow will not be used in 
createDataFrame.\n" +
--- End diff --

If we won't reach the block, I think we can simplify 
`_createFromPandasWithArrow()` like 
https://github.com/BryanCutler/spark/pull/28.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...

2017-10-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19508
  
ping @cloud-fan for final check. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19524
  
**[Test build #82900 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82900/testReport)**
 for PR 19524 at commit 
[`c56793b`](https://github.com/apache/spark/commit/c56793bac07493372f9bb360f6964032641c4867).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19531
  
**[Test build #82899 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82899/testReport)**
 for PR 19531 at commit 
[`b30de47`](https://github.com/apache/spark/commit/b30de470a11ca3f360260a8a36bc1e5eb4f355e8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class JoinEstimation(join: Join) extends Logging `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19514
  
**[Test build #82901 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82901/consoleFull)**
 for PR 19514 at commit 
[`7ec2dc7`](https://github.com/apache/spark/commit/7ec2dc70ce26ff754c0ea38f3cf9964d67bc62f7).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19485: [SPARK-20055] [Docs] Added documentation for load...

2017-10-18 Thread jomach

Github user jomach closed the pull request at:

https://github.com/apache/spark/pull/19485


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19534
  
**[Test build #82903 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82903/testReport)**
 for PR 19534 at commit 
[`f8fcc35`](https://github.com/apache/spark/commit/f8fcc3560e087440c7618b33cc892f3feafd4a3a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-18 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145603645
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_type, 
_cast_pandas_series_type
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+
+if schema is None or isinstance(schema, list):
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# There will be at least 1 batch after slicing the 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+
+# If passed schema as a list of names then rename fields
+if isinstance(schema, list):
+fields = []
+for i, field in enumerate(schema_from_arrow):
+field.name = schema[i]
+fields.append(field)
+schema = StructType(fields)
+else:
+schema = schema_from_arrow
+else:
+batches = []
+for i, pdf_slice in enumerate(pdf_slices):
+
+# convert to series to pyarrow.Arrays to use mask when 
creating Arrow batches
+arrs = []
+names = []
+for c, (_, series) in enumerate(pdf_slice.iteritems()):
+field = schema[c]
+names.append(field.name)
+t = to_arrow_type(field.dataType)
+try:
+# NOTE: casting is not necessary with Arrow >= 0.7
+
arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t),
+ 
mask=series.isnull(), type=t))
+except ValueError as e:
+warnings.warn("Arrow will not be used in 
createDataFrame: %s" % str(e))
+return None
+batches.append(pa.RecordBatch.from_arrays(arrs, names))
+
+# Verify schema of first batch, return None if not equal 
and fallback without Arrow
+if i == 0:
+schema_from_arrow = 
from_arrow_schema(batches[i].schema)
+if schema != schema_from_arrow:
+warnings.warn("Arrow will not be used in 
createDataFrame.\n" +
--- End diff --

Will we reach this block?
I guess not because all datatypes are casted to the types specified by the 
schema otherwise some exception like `ValueError` are raised and fallback to 
withtout-Arrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19508
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82897/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19508
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19521


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19521
  
Thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r145602871
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
+schema.fields.find(f => f.dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType |
+   FloatType | DoubleType => false
+  case _ => true
+}).isEmpty &&
+  !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) &&
+  children.find(p => WholeStageCodegenExec.isTooManyFields(conf, 
p.schema)).isEmpty
+  }
+
+  private val columnIndices =
+attributes.map(a => relation.output.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+
+  private val relationSchema = relation.schema.toArray
+
+  private lazy val columnarBatchSchema = new 
StructType(columnIndices.map(i => relationSchema(i)))
+
+  private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): 
ColumnarBatch = {
+val rowCount = cachedColumnarBatch.numRows
+val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, 
columnarBatchSchema)
--- End diff --

I see. Let us make a follow-up PR in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...

2017-10-18 Thread sitalkedia

Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/19534
  
cc - @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19534: [SPARK-22312][CORE] Fix bug in Executor allocatio...

2017-10-18 Thread sitalkedia

GitHub user sitalkedia opened a pull request:

https://github.com/apache/spark/pull/19534

[SPARK-22312][CORE] Fix bug in Executor allocation manager in runningâ¦

## What changes were proposed in this pull request?

We often see the issue of Spark jobs stuck because the Executor Allocation 
Manager does not ask for any executor even if there are pending tasks in case 
dynamic allocation is turned on. Looking at the logic in Executor Allocation 
Manager, which calculates the running tasks, it can happen that the calculation 
will be wrong and the number of running tasks can become negative.


## How was this patch tested?

Added unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sitalkedia/spark skedia/fix_stuck_job

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19534.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19534


commit 4f0cffa42c828d3f49e983dae8b2188b78036fcc
Author: Sital Kedia 
Date:   2017-10-19T05:24:38Z

[SPARK-22312][CORE] Fix bug in Executor allocation manager in running tasks 
calculation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19269
  
**[Test build #82902 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82902/testReport)**
 for PR 19269 at commit 
[`b065003`](https://github.com/apache/spark/commit/b065003f2b68e05218fe4bf4079bd6a8c6afcc22).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19527
  

Benchmark against multi-column one hot encoder.

Multi-Col, Multiple run: The first commit. Run multiple `treeAggregate` on 
columns.
Multi-Col, Single Run: Run one `treeAggregate` on all columns, see 
suggestion at https://github.com/apache/spark/pull/19527#discussion_r145457081.

Fitting:

numColums | Multi-Col, Multiple run | Multi-Col, Single Run
-- | -- | --
1 | 0.1100363843003 | 0.1296882409998
100 | 3.687933463507 | 0.3643889783995
1000 | 90.3695017947 | 2.4687475008

Transforming:

numColums | Multi-Col, Multiple run | Multi-Col, Single Run
-- | -- | --
1 | 0.1408046101999 | 0.1434849307
100 | 0.3636357813 | 0.4145960696996
1000 | 3.1933874685 | 2.8026313985


Benchmark codes:
```scala
import org.apache.spark.ml.feature._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import spark.implicits._
import scala.util.Random

val seed = 123l
val random = new Random(seed)
val n = 1
val m = 1000
val rows = sc.parallelize(1 to n).map(i=> 
Row(Array.fill(m)(random.nextInt(1000)): _*))
val struct = new StructType(Array.range(0,m,1).map(i => 
StructField(s"c$i",IntegerType,true)))
val df = spark.createDataFrame(rows, struct)
df.persist()
df.count()

val inputCols = Array.range(0,m,1).map(i => s"c$i")
val outputCols = Array.range(0,m,1).map(i => s"c${i}_encoded")

val encoder = new 
OneHotEncoderEstimator().setInputCols(inputCols).setOutputCols(outputCols)
var durationFitting = 0.0
var durationTransforming = 0.0
for (i <- 0 until 10) {
  val startFitting = System.nanoTime()
  val model = encoder.fit(df)
  val endFitting = System.nanoTime()
  durationFitting += (endFitting - startFitting) / 1e9

  val startTransforming = System.nanoTime()
  model.transform(df).count
  val endTransforming = System.nanoTime()
  durationTransforming += (endTransforming - startTransforming) / 1e9
}
println(s"fitting: ${durationFitting / 10}")
println(s"transforming: ${durationTransforming / 10}")





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r145601384
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
+schema.fields.find(f => f.dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType |
+   FloatType | DoubleType => false
+  case _ => true
+}).isEmpty &&
+  !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) &&
+  children.find(p => WholeStageCodegenExec.isTooManyFields(conf, 
p.schema)).isEmpty
+  }
+
+  private val columnIndices =
+attributes.map(a => relation.output.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+
+  private val relationSchema = relation.schema.toArray
+
+  private lazy val columnarBatchSchema = new 
StructType(columnIndices.map(i => relationSchema(i)))
+
+  private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): 
ColumnarBatch = {
+val rowCount = cachedColumnarBatch.numRows
+val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, 
columnarBatchSchema)
--- End diff --

ok maybe leave it as a followup for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19508
  
**[Test build #82897 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82897/testReport)**
 for PR 19508 at commit 
[`25003cf`](https://github.com/apache/spark/commit/25003cfd324071fed5eacc0fde6420f81516bcea).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r145597733
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -42,6 +43,13 @@ object ArrowUtils {
 case StringType => ArrowType.Utf8.INSTANCE
 case BinaryType => ArrowType.Binary.INSTANCE
 case DecimalType.Fixed(precision, scale) => new 
ArrowType.Decimal(precision, scale)
+case DateType => new ArrowType.Date(DateUnit.DAY)
+case TimestampType =>
+  timeZoneId match {
+case Some(id) => new ArrowType.Timestamp(TimeUnit.MICROSECOND, id)
--- End diff --

Looking at the read size, the timezone id here is not used? We always 
convert to python local timezone.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r145597432
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala 
---
@@ -42,6 +43,13 @@ object ArrowUtils {
 case StringType => ArrowType.Utf8.INSTANCE
 case BinaryType => ArrowType.Binary.INSTANCE
 case DecimalType.Fixed(precision, scale) => new 
ArrowType.Decimal(precision, scale)
+case DateType => new ArrowType.Date(DateUnit.DAY)
+case TimestampType =>
+  timeZoneId match {
+case Some(id) => new ArrowType.Timestamp(TimeUnit.MICROSECOND, id)
--- End diff --

Here we are mapping Spark SQL timestamp type to Arrow timestamp with 
timezone type. Does Arrow have tz-naive timestamp type?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19530
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82898/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19512
  
@FRosner, this was backported into branch-2.2 but this can't be 
automatically closed for some reasons. Could you close this one manually please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19530
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19512
  
Thanks. Merged to branch-2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19530
  
**[Test build #82898 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82898/testReport)**
 for PR 19530 at commit 
[`5046240`](https://github.com/apache/spark/commit/504624017260f5f170502462316055f1c4709e1f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19512
  
Seems fine to backport into 2.2.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19514
  
**[Test build #82901 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82901/consoleFull)**
 for PR 19514 at commit 
[`7ec2dc7`](https://github.com/apache/spark/commit/7ec2dc70ce26ff754c0ea38f3cf9964d67bc62f7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19512
  
@rxin, looks the PR was merged into master by you. Do you think it's okay 
to backport to other branches too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19514
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19533: Merge pull request #1 from apache/master

2017-10-18 Thread BiggerBrain

Github user BiggerBrain closed the pull request at:

https://github.com/apache/spark/pull/19533


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19533: Merge pull request #1 from apache/master

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19533
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19533: Merge pull request #1 from apache/master

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19533
  
@BiggerBrain, looks mistakenly open. Could you close this please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19533: Merge pull request #1 from apache/master

2017-10-18 Thread BiggerBrain

Github user BiggerBrain commented on the issue:

https://github.com/apache/spark/pull/19533
  
get one commit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19533: Merge pull request #1 from apache/master

2017-10-18 Thread BiggerBrain

GitHub user BiggerBrain opened a pull request:

https://github.com/apache/spark/pull/19533

Merge pull request #1 from apache/master

get

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BiggerBrain/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19533.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19533


commit 40294b363d31a3075a4fb1c1905dee757333c654
Author: æä¸é 
Date:   2017-10-18T08:29:29Z

Merge pull request #1 from apache/master

get




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-18 Thread nkronenfeld

Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
nope, using lazy val initialization won't work - at the very least, 
UnsafeKryoSerializerSuite modifies conf before context construction


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19532: [CORE]stage api modify the description format, add versi...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19532
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19524: [SPARK-22302][INFRA] Remove manual backports for ...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19524#discussion_r145593044
  
--- Diff: dev/run-tests ---
@@ -20,4 +20,10 @@
 FWDIR="$(cd "`dirname $0`"/..; pwd)"
 cd "$FWDIR"
 
+PYTHON_VERSION_CHECK=$(python -c 'import sys; print(sys.version_info < (2, 
7, 0))')
--- End diff --

@holdenk and @shaneknapp, looks I can't just check this within Python 
scripts.

As we know from the previous issue, `dev/run-tests.py` is not compatible 
with Python 2.6.x due to dictionary comprehension syntax:

```
python2.6 dev/run-tests.py
```

```
  File "dev/run-tests.py", line 128
{m: set(m.dependencies).intersection(modules_to_test) for m in 
modules_to_test}, sort=True)
^
SyntaxError: invalid syntax
```

I tried to change this as below:

```
git diff dev/run-tests.py
```

```diff
diff --git a/dev/run-tests.py b/dev/run-tests.py
index 72d148d7ea0..dcec912ae23 100755
--- a/dev/run-tests.py
+++ b/dev/run-tests.py
@@ -27,6 +27,10 @@ import sys
 import subprocess
 from collections import namedtuple

+if sys.version_info < (2, 7):
+print("[error] Python versions prior to 2.7 are not supported.")
+sys.exit(-1)
+
 from sparktestsupport import SPARK_HOME, USER_HOME, ERROR_CODES
 from sparktestsupport.shellutils import exit_from_command_with_retcode, 
run_cmd, rm_r, which
 from sparktestsupport.toposort import toposort_flatten, toposort
```

but it still gives syntax error ahead:

```
python2.6 dev/run-tests.py
```

```
  File "dev/run-tests.py", line 128
{m: set(m.dependencies).intersection(modules_to_test) for m in 
modules_to_test}, sort=True)
^
SyntaxError: invalid syntax
```

I think we should not force to workaround for Python 2.6 syntax error as it 
was dropped. So, I just tried to fix both `dev/run-tests` and 
`dev/run-tests-jenkins` as proposed currently.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19529
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82895/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19529
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19532: [CORE]stage api modify the description format, ad...

2017-10-18 Thread guoxiaolongzte

GitHub user guoxiaolongzte opened a pull request:

https://github.com/apache/spark/pull/19532

[CORE]stage api modify the description format, add version api, modify the 
duration real-time calculation


## What changes were proposed in this pull request?


stage api modify the description format
A list of all stages for a given application.
?status=[active|complete|pending|failed] list only stages 
in the state.
content should be included in  
add version api doc '/api/v1/version'
modify the duration real-time calculation in running appcations

## How was this patch tested?
manual tests

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guoxiaolongzte/spark SPARK-22311

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19532.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19532


commit 8f53eceb9ed3c33388cef09f628dfb7e4f6de70d
Author: guoxiaolong 
Date:   2017-10-19T03:15:13Z

[CORE]stage api modify the description format, add version api, modify the 
duration real-time calculation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19524
  
**[Test build #82900 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82900/testReport)**
 for PR 19524 at commit 
[`c56793b`](https://github.com/apache/spark/commit/c56793bac07493372f9bb360f6964032641c4867).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19529
  
**[Test build #82895 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82895/testReport)**
 for PR 19529 at commit 
[`802a958`](https://github.com/apache/spark/commit/802a958b640067b99fda0b2c8587dea5b8000495).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19531
  
**[Test build #82899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82899/testReport)**
 for PR 19531 at commit 
[`b30de47`](https://github.com/apache/spark/commit/b30de470a11ca3f360260a8a36bc1e5eb4f355e8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...

2017-10-18 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19531
  
cc @cloud-fan @gatorsmile @ron8hu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19531: [SPARK-22310] [SQL] Refactor join estimation to i...

2017-10-18 Thread wzhfy

GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/19531

[SPARK-22310] [SQL] Refactor join estimation to incorporate estimation 
logic for different kinds of statistics

## What changes were proposed in this pull request?

The current join estimation logic is only based on basic column statistics 
(such as ndv, etc). If we want to add estimation for other kinds of statistics 
(such as histograms), it's not easy to incorporate into the current algorithm:
1. When we have multiple pairs of join keys, the current algorithm computes 
cardinality in a single formula. But if different join keys have different 
kinds of stats, the computation logic for each pair of join keys become 
different, so the previous formula does not apply.
2. Currently it computes cardinality and updates join keys' column stats 
separately. It's better to do these two steps together, since both computation 
and update logic are different for different kinds of stats.

## How was this patch tested?

Only refactor, covered by existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark join_est_refactor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19531.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19531


commit b30de470a11ca3f360260a8a36bc1e5eb4f355e8
Author: Zhenhua Wang 
Date:   2017-10-19T02:45:53Z

refactor




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145591347
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'skip' (filter out rows with invalid data) or 'error' 
(throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'skip' (filter out rows with invalid data) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
+ *
+ * @note This is different from scikit-learn's OneHotEncoder, which keeps 
all categories.
+ * The output vectors are sparse.
+ *
+ * @see `StringIndexer` for converting categorical values into category 
indices
+ */
+@Since("2.3.0")
+class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("oneHotEncoder"))
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(values: Array[String]): this.type = set(inputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(values: Array[String]): this.type = set(outputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setDropLast(value: Boolean): this.type = set(dropLast, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  @Since("2.3.0")
+  override def transformSchema(schema: StructType): S

[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19530
  
**[Test build #82898 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82898/testReport)**
 for PR 19530 at commit 
[`5046240`](https://github.com/apache/spark/commit/504624017260f5f170502462316055f1c4709e1f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19530: [SPARK-22309][ML] Remove unused param in `LDAMode...

2017-10-18 Thread zhengruifeng

GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/19530

[SPARK-22309][ML] Remove unused param in 
`LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in 
RandomForest

## What changes were proposed in this pull request?
Remove unused param in `LDAModel.getTopicDistributionMethod` 
destory `nodeToFeaturesBc` in RandomForest

## How was this patch tested?
existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark lda_bc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19530.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19530


commit d4d2cad7d2ab7778f995790d69510d9f5425b1a4
Author: Zheng RuiFeng 
Date:   2017-10-19T02:42:04Z

create pr

commit 504624017260f5f170502462316055f1c4709e1f
Author: Zheng RuiFeng 
Date:   2017-10-19T02:49:23Z

update pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19508
  
**[Test build #82897 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82897/testReport)**
 for PR 19508 at commit 
[`25003cf`](https://github.com/apache/spark/commit/25003cfd324071fed5eacc0fde6420f81516bcea).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-10-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17819
  
@MLnick Any more comments or thoughts on this I need to address?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector...

2017-10-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19508#discussion_r145588024
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala
 ---
@@ -495,6 +474,8 @@ private[columnar] case object DictionaryEncoding 
extends CompressionScheme {
   columnType.dataType match {
 case _: IntegerType =>
   val dictionaryIds = columnVector.reserveDictionaryIds(capacity)
+  val intDictionary = dictionary.map(_.asInstanceOf[Int])
--- End diff --

Thanks @kiszk. I checked the comment and code and reverted this change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19459
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82894/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19459
  
**[Test build #82894 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82894/testReport)**
 for PR 19459 at commit 
[`3052f30`](https://github.com/apache/spark/commit/3052f3063e965d3636dd172a6981d93155b77fd2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r145586905
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 ---
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+/**
+ * The logical plan for writing data into data source v2.
+ */
+case class WriteToDataSourceV2(writer: DataSourceV2Writer, query: 
LogicalPlan) extends LogicalPlan {
+  override def children: Seq[LogicalPlan] = Seq(query)
+  override def output: Seq[Attribute] = Nil
+}
+
+/**
+ * The physical plan for writing data into data source v2.
+ */
+case class WriteToDataSourceV2Exec(writer: DataSourceV2Writer, query: 
SparkPlan) extends SparkPlan {
+  override def children: Seq[SparkPlan] = Seq(query)
+  override def output: Seq[Attribute] = Nil
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val writeTask = writer match {
+  case w: SupportsWriteInternalRow => 
w.createInternalRowWriterFactory()
+  case _ => new 
RowToInternalRowDataWriterFactory(writer.createWriterFactory(), query.schema)
+}
+
+val rdd = query.execute()
+val messages = new Array[WriterCommitMessage](rdd.partitions.length)
+
+logInfo(s"Start processing data source writer: $writer. " +
+  s"The input RDD has ${messages.length} partitions.")
+
+try {
+  sparkContext.runJob(
+rdd,
+(context: TaskContext, iter: Iterator[InternalRow]) =>
+  DataWritingSparkTask.run(writeTask, context, iter),
+rdd.partitions.indices,
+(index, message: WriterCommitMessage) => messages(index) = message
+  )
+
+  logInfo(s"Data source writer $writer is committing.")
+  writer.commit(messages)
+  logInfo(s"Data source writer $writer committed.")
+} catch {
+  case cause: Throwable =>
+logError(s"Data source writer $writer is aborting.")
+try {
+  writer.abort(messages)
+} catch {
+  case t: Throwable =>
+logError(s"Data source writer $writer failed to abort.")
+cause.addSuppressed(t)
+throw new SparkException("Writing job failed.", cause)
+}
+logError(s"Data source writer $writer aborted.")
+throw new SparkException("Writing job aborted.", cause)
+}
+
+sparkContext.emptyRDD
+  }
+}
+
+object DataWritingSparkTask extends Logging {
--- End diff --

This is mainly for log. If we inline these codes to 
`WriteToDataSourceV2Exec` and make `WriteToDataSourceV2Exec` extends `Logging`, 
then we have to serialize and send `WriteToDataSourceV2Exec` to executor side 
for the logging.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19357
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19357
  
**[Test build #82896 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82896/testReport)**
 for PR 19357 at commit 
[`4995e3c`](https://github.com/apache/spark/commit/4995e3c703f65788a997f88df10427aca291d70a).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class EquiHeightHistogram(height: Double, ehBuckets: 
Seq[EquiHeightBucket]) extends Histogram `
  * `case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19357
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82896/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19357
  
**[Test build #82896 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82896/testReport)**
 for PR 19357 at commit 
[`4995e3c`](https://github.com/apache/spark/commit/4995e3c703f65788a997f88df10427aca291d70a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-18 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r145579513
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
 ---
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import java.io.Serializable;
+
+import org.apache.spark.annotation.InterfaceStability;
+
+/**
+ * A factory of {@link DataWriter} returned by {@link 
DataSourceV2Writer#createWriterFactory()},
+ * which is responsible for creating and initializing the actual data 
writer at executor side.
+ *
+ * Note that, the writer factory will be serialized and sent to executors, 
then the data writer
+ * will be created on executors and do the actual writing. So {@link 
DataWriterFactory} must be
+ * serializable and {@link DataWriter} doesn't need to be.
+ */
+@InterfaceStability.Evolving
+public interface DataWriterFactory extends Serializable {
+
+  /**
+   * Returns a data writer to do the actual writing work.
+   *
+   * @param stageId The id of the Spark stage that runs the returned 
writer.
+   * @param partitionId The id of the RDD partition that the returned 
writer will process.
+   * @param attemptNumber The attempt number of the Spark task that runs 
the returned writer, which
+   *  is usually 0 if the task is not a retried task 
or a speculative task.
+   */
+  DataWriter createWriter(int stageId, int partitionId, int 
attemptNumber);
--- End diff --

not sure why we have stageId here. I'd make it more generic, e.g. a string 
for some job id, and then some numeric value (64 bit long) for epoch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145577058
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'skip' (filter out rows with invalid data) or 'error' 
(throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'skip' (filter out rows with invalid data) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
+ *
+ * @note This is different from scikit-learn's OneHotEncoder, which keeps 
all categories.
+ * The output vectors are sparse.
+ *
+ * @see `StringIndexer` for converting categorical values into category 
indices
+ */
+@Since("2.3.0")
+class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("oneHotEncoder"))
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(values: Array[String]): this.type = set(inputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(values: Array[String]): this.type = set(outputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setDropLast(value: Boolean): this.type = set(dropLast, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  @Since("2.3.0")
+  override def transformSchema(schema: StructType): S

[GitHub] spark issue #13252: [SPARK-15473][SQL] CSV data source writes header for emp...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13252
  
It's notbresolved yet. I am not working on this for now. Please take over 
this if you are willing to do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...

2017-10-18 Thread nkronenfeld

Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
There is one small hack in the way this was done, which is documented - see 
the comments and documentation on SharedSparkSession.initializeSession and 
SharedSparkContext.initializeContext.  I would rather just have the session and 
context be lazy transient val's, which would work fine without this initialize 
call, but I didn't want to change the way tests currently ran without input.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19505
  
I meant to ask if others agree with the current change as I could not see 
the ongoing discussion at that time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...

2017-10-18 Thread nkronenfeld

Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
I made my original changes here by using

git mv PlanTest.scala PlanTestBase.scala
git mv SQLTestUnit.scala SQLTestUnitBase.scala
git mv SharedSQLContext.scala SharedSparkSession.scala

and then created new PlanTest, SQLTestUnit, and SharedSQLContext files, 
under the assumption that most of the code would go in the base, and this would 
help git provide better history and continuity.  I'm not sure if that was the 
right decision or not - probably it is with PlanTest and SQLTestUnit, possibly 
not with SharedSQLContext, but since the diff in the PR doesn't reflect the git 
mv properly, I'm not sure if it will make a difference either way.

If reviewers wish me to redo this PR without the initial `git mv`, I'll be 
happy to. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19529
  
**[Test build #82895 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82895/testReport)**
 for PR 19529 at commit 
[`802a958`](https://github.com/apache/spark/commit/802a958b640067b99fda0b2c8587dea5b8000495).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19529: Support alternative unit testing styles in extern...

2017-10-18 Thread nkronenfeld

GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/19529

Support alternative unit testing styles in external applications

## What changes were proposed in this pull request?
Support unit tests of external code (i.e., applications that use spark) 
using scalatest that don't want to use FunSuite.  SharedSparkContext already 
supports this, but SharedSQLContext does not.

I've introduced SharedSparkSession as a parent to SharedSQLContext, written 
in a way that it does support all scalatest styles.

## How was this patch tested?
There are three new unit test suites added that just test using FunSpec, 
FlatSpec, and WordSpec.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark alternative-style-tests-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19529.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19529


commit b9d41cd79f05f6c420d070ad07cdfa8f853fd461
Author: Nathan Kronenfeld 
Date:   2017-10-15T03:04:16Z

Separate out the portion of SharedSQLContext that requires a FunSuite from 
the part that works with just any old test suite.

commit 0d4bd97247a2d083c7de55663703b38a34298c9c
Author: Nathan Kronenfeld 
Date:   2017-10-15T15:57:09Z

Fix typo in trait name

commit 83c44f1c24619e906af48180d0aace38587aa88d
Author: Nathan Kronenfeld 
Date:   2017-10-15T15:57:42Z

Add simple tests for each non-FunSuite test style

commit e460612ec6f36e62d8d21d88c2344378ecba581a
Author: Nathan Kronenfeld 
Date:   2017-10-15T16:20:44Z

Document testing possibilities

commit 0ee2aadf29b681b23bed356b14038525574204a5
Author: Nathan Kronenfeld 
Date:   2017-10-18T23:46:44Z

Better documentation of testing procedures

commit 802a958b640067b99fda0b2c8587dea5b8000495
Author: Nathan Kronenfeld 
Date:   2017-10-18T23:46:58Z

Same initialization issue in SharedSparkContext as is in SharedSparkSession




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-18 Thread sitalkedia

Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/18805
  
Created https://github.com/luben/zstd-jni/issues/47. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r145572702
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
 ---
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, 
RowEncoder}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+/**
+ * The logical plan for writing data into data source v2.
+ */
+case class WriteToDataSourceV2(writer: DataSourceV2Writer, query: 
LogicalPlan) extends LogicalPlan {
+  override def children: Seq[LogicalPlan] = Seq(query)
+  override def output: Seq[Attribute] = Nil
+}
+
+/**
+ * The physical plan for writing data into data source v2.
+ */
+case class WriteToDataSourceV2Exec(writer: DataSourceV2Writer, query: 
SparkPlan) extends SparkPlan {
+  override def children: Seq[SparkPlan] = Seq(query)
+  override def output: Seq[Attribute] = Nil
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val writeTask = writer match {
+  case w: SupportsWriteInternalRow => 
w.createInternalRowWriterFactory()
+  case _ => new 
RowToInternalRowDataWriterFactory(writer.createWriterFactory(), query.schema)
+}
+
+val rdd = query.execute()
+val messages = new Array[WriterCommitMessage](rdd.partitions.length)
+
+logInfo(s"Start processing data source writer: $writer. " +
+  s"The input RDD has ${messages.length} partitions.")
+
+try {
+  sparkContext.runJob(
+rdd,
+(context: TaskContext, iter: Iterator[InternalRow]) =>
+  DataWritingSparkTask.run(writeTask, context, iter),
+rdd.partitions.indices,
+(index, message: WriterCommitMessage) => messages(index) = message
+  )
+
+  logInfo(s"Data source writer $writer is committing.")
+  writer.commit(messages)
+  logInfo(s"Data source writer $writer committed.")
+} catch {
+  case cause: Throwable =>
+logError(s"Data source writer $writer is aborting.")
+try {
+  writer.abort(messages)
+} catch {
+  case t: Throwable =>
+logError(s"Data source writer $writer failed to abort.")
+cause.addSuppressed(t)
+throw new SparkException("Writing job failed.", cause)
+}
+logError(s"Data source writer $writer aborted.")
+throw new SparkException("Writing job aborted.", cause)
+}
+
+sparkContext.emptyRDD
+  }
+}
+
+object DataWritingSparkTask extends Logging {
--- End diff --

What is the reason we need to create a separate object for this function 
`run`? Why  not moving it to `WriteToDataSourceV2Exec `?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19485
  
Could it be an option to leave a link back to the new page in the API doc 
to refer the options and remove the option list in API doc @gatorsmile and 
@liancheng?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145569192
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'skip' (filter out rows with invalid data) or 'error' 
(throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'skip' (filter out rows with invalid data) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
+ *
+ * @note This is different from scikit-learn's OneHotEncoder, which keeps 
all categories.
+ * The output vectors are sparse.
+ *
+ * @see `StringIndexer` for converting categorical values into category 
indices
+ */
+@Since("2.3.0")
+class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("oneHotEncoder"))
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(values: Array[String]): this.type = set(inputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(values: Array[String]): this.type = set(outputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setDropLast(value: Boolean): this.type = set(dropLast, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  @Since("2.3.0")
+  override def transformSchema(schema: StructType): S

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19459
  
**[Test build #82894 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82894/testReport)**
 for PR 19459 at commit 
[`3052f30`](https://github.com/apache/spark/commit/3052f3063e965d3636dd172a6981d93155b77fd2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145569379
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'skip' (filter out rows with invalid data) or 'error' 
(throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'skip' (filter out rows with invalid data) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
+ *
+ * @note This is different from scikit-learn's OneHotEncoder, which keeps 
all categories.
+ * The output vectors are sparse.
+ *
+ * @see `StringIndexer` for converting categorical values into category 
indices
+ */
+@Since("2.3.0")
+class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("oneHotEncoder"))
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(values: Array[String]): this.type = set(inputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(values: Array[String]): this.type = set(outputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setDropLast(value: Boolean): this.type = set(dropLast, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  @Since("2.3.0")
+  override def transformSchema(schema: StructType): S

[GitHub] spark pull request #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...

2017-10-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19505#discussion_r145568265
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2192,67 +2208,82 @@ def pandas_udf(f=None, returnType=StringType()):
 :param f: user-defined function. A python function if used as a 
standalone function
 :param returnType: a :class:`pyspark.sql.types.DataType` object
 
-The user-defined function can define one of the following 
transformations:
-
-1. One or more `pandas.Series` -> A `pandas.Series`
-
-   This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
-   :meth:`pyspark.sql.DataFrame.select`.
-   The returnType should be a primitive data type, e.g., 
`DoubleType()`.
-   The length of the returned `pandas.Series` must be of the same as 
the input `pandas.Series`.
-
-   >>> from pyspark.sql.types import IntegerType, StringType
-   >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
-   >>> @pandas_udf(returnType=StringType())
-   ... def to_upper(s):
-   ... return s.str.upper()
-   ...
-   >>> @pandas_udf(returnType="integer")
-   ... def add_one(x):
-   ... return x + 1
-   ...
-   >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
-   >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
-   ... .show()  # doctest: +SKIP
-   +--+--++
-   |slen(name)|to_upper(name)|add_one(age)|
-   +--+--++
-   | 8|  JOHN DOE|  22|
-   +--+--++
-
-2. A `pandas.DataFrame` -> A `pandas.DataFrame`
-
-   This udf is only used with :meth:`pyspark.sql.GroupedData.apply`.
-   The returnType should be a :class:`StructType` describing the 
schema of the returned
-   `pandas.DataFrame`.
-
-   >>> df = spark.createDataFrame(
-   ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
-   ... ("id", "v"))
-   >>> @pandas_udf(returnType=df.schema)
-   ... def normalize(pdf):
-   ... v = pdf.v
-   ... return pdf.assign(v=(v - v.mean()) / v.std())
-   >>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
-   +---+---+
-   | id|  v|
-   +---+---+
-   |  1|-0.7071067811865475|
-   |  1| 0.7071067811865475|
-   |  2|-0.8320502943378437|
-   |  2|-0.2773500981126146|
-   |  2| 1.1094003924504583|
-   +---+---+
-
-   .. note:: This type of udf cannot be used with functions such as 
`withColumn` or `select`
- because it defines a `DataFrame` transformation rather 
than a `Column`
- transformation.
-
-   .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
+The user-defined function can define the following transformation:
+
+One or more `pandas.Series` -> A `pandas.Series`
+
+This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
+:meth:`pyspark.sql.DataFrame.select`.
+The returnType should be a primitive data type, e.g., `DoubleType()`.
+The length of the returned `pandas.Series` must be of the same as the 
input `pandas.Series`.
+
+>>> from pyspark.sql.types import IntegerType, StringType
+>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
+>>> @pandas_udf(returnType=StringType())
+... def to_upper(s):
+... return s.str.upper()
+...
+>>> @pandas_udf(returnType="integer")
+... def add_one(x):
+... return x + 1
+...
+>>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", 
"age"))
+>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
+... .show()  # doctest: +SKIP
++--+--++
+|slen(name)|to_upper(name)|add_one(age)|
++--+--++
+| 8|  JOHN DOE|  22|
++--+--++
+
+.. note:: The user-defined function must be deterministic.
+"""
+return _create_udf(f, returnType=returnType, 
pythonUdfType=PythonUdfType.PANDAS_UDF)
+
+
+@since(2.3)
+def pandas_grouped_udf(f=None, returnType=StructType()):
--- End diff --

Post it in another PR https://github.com/apache/spark/pull/19517? This 
discussion thread will be collapsed when Takuya made a code change. 


---

---

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19527#discussion_r145567644
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -0,0 +1,439 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model, Transformer}
+import org.apache.spark.ml.attribute._
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, 
HasInputCols, HasOutputCol, HasOutputCols}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, 
StructType}
+
+/** Private trait for params for OneHotEncoderEstimator and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data.
+   * Options are 'skip' (filter out rows with invalid data) or 'error' 
(throw an error).
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"How to handle invalid data " +
+"Options are 'skip' (filter out rows with invalid data) or error 
(throw an error).",
+
ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids))
+
+  setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID)
+
+  /**
+   * Whether to drop the last category in the encoded vector (default: 
true)
+   * @group param
+   */
+  @Since("2.3.0")
+  final val dropLast: BooleanParam =
+new BooleanParam(this, "dropLast", "whether to drop the last category")
+  setDefault(dropLast -> true)
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getDropLast: Boolean = $(dropLast)
+}
+
+/**
+ * A one-hot encoder that maps a column of category indices to a column of 
binary vectors, with
+ * at most a single one-value per row that indicates the input category 
index.
+ * For example with 5 categories, an input value of 2.0 would map to an 
output vector of
+ * `[0.0, 0.0, 1.0, 0.0]`.
+ * The last category is not included by default (configurable via 
`dropLast`),
+ * because it makes the vector entries sum up to one, and hence linearly 
dependent.
+ * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`.
+ *
+ * @note This is different from scikit-learn's OneHotEncoder, which keeps 
all categories.
+ * The output vectors are sparse.
+ *
+ * @see `StringIndexer` for converting categorical values into category 
indices
+ */
+@Since("2.3.0")
+class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("oneHotEncoder"))
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setInputCols(values: Array[String]): this.type = set(inputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setOutputCols(values: Array[String]): this.type = set(outputCols, 
values)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setDropLast(value: Boolean): this.type = set(dropLast, value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  @Since("2.3.0")
+  override def transformSchema(schema: StructType): S

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19524
  
Sure, thanks. Let me update soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19485
  
My only worry is duplication and we would have another place to update the 
doc for options. Others sound okay to me too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path

2017-10-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r145563004
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java 
---
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import org.apache.spark.annotation.InterfaceStability;
+
+/**
+ * A data writer returned by {@link DataWriterFactory#createWriter(int, 
int, int)} and is
+ * responsible for writing data for an input RDD partition.
+ *
+ * One Spark task has one exclusive data writer, so there is no 
thread-safe concern.
+ *
+ * {@link #write(Object)} is called for each record in the input RDD 
partition. If one record fails
+ * the {@link #write(Object)}, {@link #abort()} is called afterwards and 
the remaining records will
+ * not be processed. If all records are successfully written, {@link 
#commit()} is called.
+ *
+ * If this data writer successes(all records are successfully written and 
{@link #commit()}
--- End diff --

`successes ` -> `succeeds`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...

2017-10-18 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/19505#discussion_r145552397
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2192,67 +2208,82 @@ def pandas_udf(f=None, returnType=StringType()):
 :param f: user-defined function. A python function if used as a 
standalone function
 :param returnType: a :class:`pyspark.sql.types.DataType` object
 
-The user-defined function can define one of the following 
transformations:
-
-1. One or more `pandas.Series` -> A `pandas.Series`
-
-   This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
-   :meth:`pyspark.sql.DataFrame.select`.
-   The returnType should be a primitive data type, e.g., 
`DoubleType()`.
-   The length of the returned `pandas.Series` must be of the same as 
the input `pandas.Series`.
-
-   >>> from pyspark.sql.types import IntegerType, StringType
-   >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
-   >>> @pandas_udf(returnType=StringType())
-   ... def to_upper(s):
-   ... return s.str.upper()
-   ...
-   >>> @pandas_udf(returnType="integer")
-   ... def add_one(x):
-   ... return x + 1
-   ...
-   >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", 
"name", "age"))
-   >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
-   ... .show()  # doctest: +SKIP
-   +--+--++
-   |slen(name)|to_upper(name)|add_one(age)|
-   +--+--++
-   | 8|  JOHN DOE|  22|
-   +--+--++
-
-2. A `pandas.DataFrame` -> A `pandas.DataFrame`
-
-   This udf is only used with :meth:`pyspark.sql.GroupedData.apply`.
-   The returnType should be a :class:`StructType` describing the 
schema of the returned
-   `pandas.DataFrame`.
-
-   >>> df = spark.createDataFrame(
-   ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
-   ... ("id", "v"))
-   >>> @pandas_udf(returnType=df.schema)
-   ... def normalize(pdf):
-   ... v = pdf.v
-   ... return pdf.assign(v=(v - v.mean()) / v.std())
-   >>> df.groupby('id').apply(normalize).show()  # doctest: +SKIP
-   +---+---+
-   | id|  v|
-   +---+---+
-   |  1|-0.7071067811865475|
-   |  1| 0.7071067811865475|
-   |  2|-0.8320502943378437|
-   |  2|-0.2773500981126146|
-   |  2| 1.1094003924504583|
-   +---+---+
-
-   .. note:: This type of udf cannot be used with functions such as 
`withColumn` or `select`
- because it defines a `DataFrame` transformation rather 
than a `Column`
- transformation.
-
-   .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
+The user-defined function can define the following transformation:
+
+One or more `pandas.Series` -> A `pandas.Series`
+
+This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and
+:meth:`pyspark.sql.DataFrame.select`.
+The returnType should be a primitive data type, e.g., `DoubleType()`.
+The length of the returned `pandas.Series` must be of the same as the 
input `pandas.Series`.
+
+>>> from pyspark.sql.types import IntegerType, StringType
+>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
+>>> @pandas_udf(returnType=StringType())
+... def to_upper(s):
+... return s.str.upper()
+...
+>>> @pandas_udf(returnType="integer")
+... def add_one(x):
+... return x + 1
+...
+>>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", 
"age"))
+>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), 
add_one("age")) \\
+... .show()  # doctest: +SKIP
++--+--++
+|slen(name)|to_upper(name)|add_one(age)|
++--+--++
+| 8|  JOHN DOE|  22|
++--+--++
+
+.. note:: The user-defined function must be deterministic.
+"""
+return _create_udf(f, returnType=returnType, 
pythonUdfType=PythonUdfType.PANDAS_UDF)
+
+
+@since(2.3)
+def pandas_grouped_udf(f=None, returnType=StructType()):
--- End diff --

Here is a summary of the current proposal:

I. Use only `pandas_udf`
--
The main issues with this approach as a few people comment out is that it 
is hard to know

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82893/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #82893 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82893/testReport)**
 for PR 19439 at commit 
[`d114e40`](https://github.com/apache/spark/commit/d114e4017c06b7a8c1408ef8d289883c990c5046).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19383: [SPARK-20643][core] Add listener implementation to colle...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19383
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode

2017-10-18 Thread imaxxs

Github user imaxxs commented on the issue:

https://github.com/apache/spark/pull/18784
  
I work with @ArtRand and @susanxhuynh Fine grained mode has been deprecated 
for a while. If it is standard procedure to wait till next release and if that 
is 3.0 we should wait till Spark 3.0 release. @skonto @srowen  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19383: [SPARK-20643][core] Add listener implementation to colle...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19383
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82890/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19383: [SPARK-20643][core] Add listener implementation to colle...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19383
  
**[Test build #82890 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82890/testReport)**
 for PR 19383 at commit 
[`ca43746`](https://github.com/apache/spark/commit/ca4374612abdaab7cd1c449e65d87878c68e15d2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19528: [SPARK-20393] [Core] Existing patch applied to 1.6 branc...

2017-10-18 Thread ambauma

Github user ambauma commented on the issue:

https://github.com/apache/spark/pull/19528
  
Understood.  Working on porting to 2.0...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19509
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82888/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19509
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19509
  
**[Test build #82888 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82888/testReport)**
 for PR 19509 at commit 
[`94223be`](https://github.com/apache/spark/commit/94223beaeffa9793fc1529bafd8a65b4b3185d7a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19521
  
Thank you, @rxin !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19521
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18692
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82892/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18692
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18692
  
**[Test build #82892 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82892/testReport)**
 for PR 18692 at commit 
[`b69185c`](https://github.com/apache/spark/commit/b69185c2a7466b69a3f244a257449dbf1dd0ee21).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18692
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 294 matches

Mail list logo