[GitHub] spark pull request #19512: [SPARK-21551][Python] Increase timeout for Python...
Github user FRosner closed the pull request at: https://github.com/apache/spark/pull/19512 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19509 LGTM, merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18747: [SPARK-20822][SQL] Generate code to directly get value f...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/18747 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145611544 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_type, _cast_pandas_series_type +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) + +if schema is None or isinstance(schema, list): +batches = [pa.RecordBatch.from_pandas(pdf_slice, preserve_index=False) + for pdf_slice in pdf_slices] + +# There will be at least 1 batch after slicing the pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) + +# If passed schema as a list of names then rename fields +if isinstance(schema, list): +fields = [] +for i, field in enumerate(schema_from_arrow): +field.name = schema[i] +fields.append(field) +schema = StructType(fields) +else: +schema = schema_from_arrow +else: +batches = [] +for i, pdf_slice in enumerate(pdf_slices): + +# convert to series to pyarrow.Arrays to use mask when creating Arrow batches +arrs = [] +names = [] +for c, (_, series) in enumerate(pdf_slice.iteritems()): +field = schema[c] +names.append(field.name) +t = to_arrow_type(field.dataType) +try: +# NOTE: casting is not necessary with Arrow >= 0.7 + arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t), + mask=series.isnull(), type=t)) +except ValueError as e: +warnings.warn("Arrow will not be used in createDataFrame: %s" % str(e)) +return None +batches.append(pa.RecordBatch.from_arrays(arrs, names)) + +# Verify schema of first batch, return None if not equal and fallback without Arrow +if i == 0: +schema_from_arrow = from_arrow_schema(batches[i].schema) +if schema != schema_from_arrow: +warnings.warn("Arrow will not be used in createDataFrame.\n" + --- End diff -- If we won't reach the block, I think we can simplify `_createFromPandasWithArrow()` like https://github.com/BryanCutler/spark/pull/28. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19508 ping @cloud-fan for final check. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19524 **[Test build #82900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82900/testReport)** for PR 19524 at commit [`c56793b`](https://github.com/apache/spark/commit/c56793bac07493372f9bb360f6964032641c4867). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19531 **[Test build #82899 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82899/testReport)** for PR 19531 at commit [`b30de47`](https://github.com/apache/spark/commit/b30de470a11ca3f360260a8a36bc1e5eb4f355e8). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class JoinEstimation(join: Join) extends Logging ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19514 **[Test build #82901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82901/consoleFull)** for PR 19514 at commit [`7ec2dc7`](https://github.com/apache/spark/commit/7ec2dc70ce26ff754c0ea38f3cf9964d67bc62f7). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19485: [SPARK-20055] [Docs] Added documentation for load...
Github user jomach closed the pull request at: https://github.com/apache/spark/pull/19485 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19534 **[Test build #82903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82903/testReport)** for PR 19534 at commit [`f8fcc35`](https://github.com/apache/spark/commit/f8fcc3560e087440c7618b33cc892f3feafd4a3a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r145603645 --- Diff: python/pyspark/sql/session.py --- @@ -414,6 +415,73 @@ def _createFromLocal(self, data, schema): data = [schema.toInternal(row) for row in data] return self._sc.parallelize(data), schema +def _createFromPandasWithArrow(self, pdf, schema): +""" +Create a DataFrame from a given pandas.DataFrame by slicing it into partitions, converting +to Arrow data, then sending to the JVM to parallelize. If a schema is passed in, the +data types will be used to coerce the data in Pandas to Arrow conversion. +""" +from pyspark.serializers import ArrowSerializer +from pyspark.sql.types import from_arrow_schema, to_arrow_type, _cast_pandas_series_type +import pyarrow as pa + +# Slice the DataFrame into batches +step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up +pdf_slices = (pdf[start:start + step] for start in xrange(0, len(pdf), step)) + +if schema is None or isinstance(schema, list): +batches = [pa.RecordBatch.from_pandas(pdf_slice, preserve_index=False) + for pdf_slice in pdf_slices] + +# There will be at least 1 batch after slicing the pandas.DataFrame +schema_from_arrow = from_arrow_schema(batches[0].schema) + +# If passed schema as a list of names then rename fields +if isinstance(schema, list): +fields = [] +for i, field in enumerate(schema_from_arrow): +field.name = schema[i] +fields.append(field) +schema = StructType(fields) +else: +schema = schema_from_arrow +else: +batches = [] +for i, pdf_slice in enumerate(pdf_slices): + +# convert to series to pyarrow.Arrays to use mask when creating Arrow batches +arrs = [] +names = [] +for c, (_, series) in enumerate(pdf_slice.iteritems()): +field = schema[c] +names.append(field.name) +t = to_arrow_type(field.dataType) +try: +# NOTE: casting is not necessary with Arrow >= 0.7 + arrs.append(pa.Array.from_pandas(_cast_pandas_series_type(series, t), + mask=series.isnull(), type=t)) +except ValueError as e: +warnings.warn("Arrow will not be used in createDataFrame: %s" % str(e)) +return None +batches.append(pa.RecordBatch.from_arrays(arrs, names)) + +# Verify schema of first batch, return None if not equal and fallback without Arrow +if i == 0: +schema_from_arrow = from_arrow_schema(batches[i].schema) +if schema != schema_from_arrow: +warnings.warn("Arrow will not be used in createDataFrame.\n" + --- End diff -- Will we reach this block? I guess not because all datatypes are casted to the types specified by the schema otherwise some exception like `ValueError` are raised and fallback to withtout-Arrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19508 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82897/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19508 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19521 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19521 Thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r145602871 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) +schema.fields.find(f => f.dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType => false + case _ => true +}).isEmpty && + !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) && + children.find(p => WholeStageCodegenExec.isTooManyFields(conf, p.schema)).isEmpty + } + + private val columnIndices = +attributes.map(a => relation.output.map(o => o.exprId).indexOf(a.exprId)).toArray + + private val relationSchema = relation.schema.toArray + + private lazy val columnarBatchSchema = new StructType(columnIndices.map(i => relationSchema(i))) + + private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { +val rowCount = cachedColumnarBatch.numRows +val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) --- End diff -- I see. Let us make a follow-up PR in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19534: [SPARK-22312][CORE] Fix bug in Executor allocation manag...
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/19534 cc - @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19534: [SPARK-22312][CORE] Fix bug in Executor allocatio...
GitHub user sitalkedia opened a pull request: https://github.com/apache/spark/pull/19534 [SPARK-22312][CORE] Fix bug in Executor allocation manager in running⦠## What changes were proposed in this pull request? We often see the issue of Spark jobs stuck because the Executor Allocation Manager does not ask for any executor even if there are pending tasks in case dynamic allocation is turned on. Looking at the logic in Executor Allocation Manager, which calculates the running tasks, it can happen that the calculation will be wrong and the number of running tasks can become negative. ## How was this patch tested? Added unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/sitalkedia/spark skedia/fix_stuck_job Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19534.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19534 commit 4f0cffa42c828d3f49e983dae8b2188b78036fcc Author: Sital Kedia Date: 2017-10-19T05:24:38Z [SPARK-22312][CORE] Fix bug in Executor allocation manager in running tasks calculation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19269: [SPARK-22026][SQL] data source v2 write path
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19269 **[Test build #82902 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82902/testReport)** for PR 19269 at commit [`b065003`](https://github.com/apache/spark/commit/b065003f2b68e05218fe4bf4079bd6a8c6afcc22). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19527 Benchmark against multi-column one hot encoder. Multi-Col, Multiple run: The first commit. Run multiple `treeAggregate` on columns. Multi-Col, Single Run: Run one `treeAggregate` on all columns, see suggestion at https://github.com/apache/spark/pull/19527#discussion_r145457081. Fitting: numColums | Multi-Col, Multiple run | Multi-Col, Single Run -- | -- | -- 1 | 0.1100363843003 | 0.1296882409998 100 | 3.687933463507 | 0.3643889783995 1000 | 90.3695017947 | 2.4687475008 Transforming: numColums | Multi-Col, Multiple run | Multi-Col, Single Run -- | -- | -- 1 | 0.1408046101999 | 0.1434849307 100 | 0.3636357813 | 0.4145960696996 1000 | 3.1933874685 | 2.8026313985 Benchmark codes: ```scala import org.apache.spark.ml.feature._ import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import spark.implicits._ import scala.util.Random val seed = 123l val random = new Random(seed) val n = 1 val m = 1000 val rows = sc.parallelize(1 to n).map(i=> Row(Array.fill(m)(random.nextInt(1000)): _*)) val struct = new StructType(Array.range(0,m,1).map(i => StructField(s"c$i",IntegerType,true))) val df = spark.createDataFrame(rows, struct) df.persist() df.count() val inputCols = Array.range(0,m,1).map(i => s"c$i") val outputCols = Array.range(0,m,1).map(i => s"c${i}_encoded") val encoder = new OneHotEncoderEstimator().setInputCols(inputCols).setOutputCols(outputCols) var durationFitting = 0.0 var durationTransforming = 0.0 for (i <- 0 until 10) { val startFitting = System.nanoTime() val model = encoder.fit(df) val endFitting = System.nanoTime() durationFitting += (endFitting - startFitting) / 1e9 val startTransforming = System.nanoTime() model.transform(df).count val endTransforming = System.nanoTime() durationTransforming += (endTransforming - startTransforming) / 1e9 } println(s"fitting: ${durationFitting / 10}") println(s"transforming: ${durationTransforming / 10}") --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r145601384 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics -import org.apache.spark.sql.types.UserDefinedType +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, WholeStageCodegenExec} +import org.apache.spark.sql.execution.vectorized._ +import org.apache.spark.sql.types._ case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + /** + * If true, get data from ColumnVector in ColumnarBatch, which are generally faster. + * If false, get data from UnsafeRow build from ColumnVector + */ + override val supportCodegen: Boolean = { +// In the initial implementation, for ease of review +// support only primitive data types and # of fields is less than wholeStageMaxNumFields +val schema = StructType.fromAttributes(relation.output) +schema.fields.find(f => f.dataType match { + case BooleanType | ByteType | ShortType | IntegerType | LongType | + FloatType | DoubleType => false + case _ => true +}).isEmpty && + !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) && + children.find(p => WholeStageCodegenExec.isTooManyFields(conf, p.schema)).isEmpty + } + + private val columnIndices = +attributes.map(a => relation.output.map(o => o.exprId).indexOf(a.exprId)).toArray + + private val relationSchema = relation.schema.toArray + + private lazy val columnarBatchSchema = new StructType(columnIndices.map(i => relationSchema(i))) + + private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { +val rowCount = cachedColumnarBatch.numRows +val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, columnarBatchSchema) --- End diff -- ok maybe leave it as a followup for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19508 **[Test build #82897 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82897/testReport)** for PR 19508 at commit [`25003cf`](https://github.com/apache/spark/commit/25003cfd324071fed5eacc0fde6420f81516bcea). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r145597733 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -42,6 +43,13 @@ object ArrowUtils { case StringType => ArrowType.Utf8.INSTANCE case BinaryType => ArrowType.Binary.INSTANCE case DecimalType.Fixed(precision, scale) => new ArrowType.Decimal(precision, scale) +case DateType => new ArrowType.Date(DateUnit.DAY) +case TimestampType => + timeZoneId match { +case Some(id) => new ArrowType.Timestamp(TimeUnit.MICROSECOND, id) --- End diff -- Looking at the read size, the timezone id here is not used? We always convert to python local timezone. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r145597432 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -42,6 +43,13 @@ object ArrowUtils { case StringType => ArrowType.Utf8.INSTANCE case BinaryType => ArrowType.Binary.INSTANCE case DecimalType.Fixed(precision, scale) => new ArrowType.Decimal(precision, scale) +case DateType => new ArrowType.Date(DateUnit.DAY) +case TimestampType => + timeZoneId match { +case Some(id) => new ArrowType.Timestamp(TimeUnit.MICROSECOND, id) --- End diff -- Here we are mapping Spark SQL timestamp type to Arrow timestamp with timezone type. Does Arrow have tz-naive timestamp type? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19530 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82898/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19512 @FRosner, this was backported into branch-2.2 but this can't be automatically closed for some reasons. Could you close this one manually please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19530 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19512 Thanks. Merged to branch-2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19530 **[Test build #82898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82898/testReport)** for PR 19530 at commit [`5046240`](https://github.com/apache/spark/commit/504624017260f5f170502462316055f1c4709e1f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19512 Seems fine to backport into 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19514 **[Test build #82901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82901/consoleFull)** for PR 19514 at commit [`7ec2dc7`](https://github.com/apache/spark/commit/7ec2dc70ce26ff754c0ea38f3cf9964d67bc62f7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19512: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19512 @rxin, looks the PR was merged into master by you. Do you think it's okay to backport to other branches too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19514 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19533: Merge pull request #1 from apache/master
Github user BiggerBrain closed the pull request at: https://github.com/apache/spark/pull/19533 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19533: Merge pull request #1 from apache/master
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19533 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19533: Merge pull request #1 from apache/master
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19533 @BiggerBrain, looks mistakenly open. Could you close this please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19533: Merge pull request #1 from apache/master
Github user BiggerBrain commented on the issue: https://github.com/apache/spark/pull/19533 get one commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19533: Merge pull request #1 from apache/master
GitHub user BiggerBrain opened a pull request: https://github.com/apache/spark/pull/19533 Merge pull request #1 from apache/master get ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/BiggerBrain/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19533.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19533 commit 40294b363d31a3075a4fb1c1905dee757333c654 Author: æä¸é Date: 2017-10-18T08:29:29Z Merge pull request #1 from apache/master get --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 nope, using lazy val initialization won't work - at the very least, UnsafeKryoSerializerSuite modifies conf before context construction --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19532: [CORE]stage api modify the description format, add versi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19532 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19524: [SPARK-22302][INFRA] Remove manual backports for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19524#discussion_r145593044 --- Diff: dev/run-tests --- @@ -20,4 +20,10 @@ FWDIR="$(cd "`dirname $0`"/..; pwd)" cd "$FWDIR" +PYTHON_VERSION_CHECK=$(python -c 'import sys; print(sys.version_info < (2, 7, 0))') --- End diff -- @holdenk and @shaneknapp, looks I can't just check this within Python scripts. As we know from the previous issue, `dev/run-tests.py` is not compatible with Python 2.6.x due to dictionary comprehension syntax: ``` python2.6 dev/run-tests.py ``` ``` File "dev/run-tests.py", line 128 {m: set(m.dependencies).intersection(modules_to_test) for m in modules_to_test}, sort=True) ^ SyntaxError: invalid syntax ``` I tried to change this as below: ``` git diff dev/run-tests.py ``` ```diff diff --git a/dev/run-tests.py b/dev/run-tests.py index 72d148d7ea0..dcec912ae23 100755 --- a/dev/run-tests.py +++ b/dev/run-tests.py @@ -27,6 +27,10 @@ import sys import subprocess from collections import namedtuple +if sys.version_info < (2, 7): +print("[error] Python versions prior to 2.7 are not supported.") +sys.exit(-1) + from sparktestsupport import SPARK_HOME, USER_HOME, ERROR_CODES from sparktestsupport.shellutils import exit_from_command_with_retcode, run_cmd, rm_r, which from sparktestsupport.toposort import toposort_flatten, toposort ``` but it still gives syntax error ahead: ``` python2.6 dev/run-tests.py ``` ``` File "dev/run-tests.py", line 128 {m: set(m.dependencies).intersection(modules_to_test) for m in modules_to_test}, sort=True) ^ SyntaxError: invalid syntax ``` I think we should not force to workaround for Python 2.6 syntax error as it was dropped. So, I just tried to fix both `dev/run-tests` and `dev/run-tests-jenkins` as proposed currently. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19529 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82895/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19529 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19532: [CORE]stage api modify the description format, ad...
GitHub user guoxiaolongzte opened a pull request: https://github.com/apache/spark/pull/19532 [CORE]stage api modify the description format, add version api, modify the duration real-time calculation ## What changes were proposed in this pull request? stage api modify the description format A list of all stages for a given application. ?status=[active|complete|pending|failed] list only stages in the state. content should be included in add version api doc '/api/v1/version' modify the duration real-time calculation in running appcations ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/guoxiaolongzte/spark SPARK-22311 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19532.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19532 commit 8f53eceb9ed3c33388cef09f628dfb7e4f6de70d Author: guoxiaolong Date: 2017-10-19T03:15:13Z [CORE]stage api modify the description format, add version api, modify the duration real-time calculation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19524 **[Test build #82900 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82900/testReport)** for PR 19524 at commit [`c56793b`](https://github.com/apache/spark/commit/c56793bac07493372f9bb360f6964032641c4867). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19529 **[Test build #82895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82895/testReport)** for PR 19529 at commit [`802a958`](https://github.com/apache/spark/commit/802a958b640067b99fda0b2c8587dea5b8000495). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19531 **[Test build #82899 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82899/testReport)** for PR 19531 at commit [`b30de47`](https://github.com/apache/spark/commit/b30de470a11ca3f360260a8a36bc1e5eb4f355e8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19531: [SPARK-22310] [SQL] Refactor join estimation to incorpor...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19531 cc @cloud-fan @gatorsmile @ron8hu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19531: [SPARK-22310] [SQL] Refactor join estimation to i...
GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/19531 [SPARK-22310] [SQL] Refactor join estimation to incorporate estimation logic for different kinds of statistics ## What changes were proposed in this pull request? The current join estimation logic is only based on basic column statistics (such as ndv, etc). If we want to add estimation for other kinds of statistics (such as histograms), it's not easy to incorporate into the current algorithm: 1. When we have multiple pairs of join keys, the current algorithm computes cardinality in a single formula. But if different join keys have different kinds of stats, the computation logic for each pair of join keys become different, so the previous formula does not apply. 2. Currently it computes cardinality and updates join keys' column stats separately. It's better to do these two steps together, since both computation and update logic are different for different kinds of stats. ## How was this patch tested? Only refactor, covered by existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark join_est_refactor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19531.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19531 commit b30de470a11ca3f360260a8a36bc1e5eb4f355e8 Author: Zhenhua Wang Date: 2017-10-19T02:45:53Z refactor --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145591347 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'skip' (filter out rows with invalid data) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'skip' (filter out rows with invalid data) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`. + * + * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories. + * The output vectors are sparse. + * + * @see `StringIndexer` for converting categorical values into category indices + */ +@Since("2.3.0") +class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val uid: String) +extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("oneHotEncoder")) + + /** @group setParam */ + @Since("2.3.0") + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setDropLast(value: Boolean): this.type = set(dropLast, value) + + /** @group setParam */ + @Since("2.3.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + @Since("2.3.0") + override def transformSchema(schema: StructType): S
[GitHub] spark issue #19530: [SPARK-22309][ML] Remove unused param in `LDAModel.getTo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19530 **[Test build #82898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82898/testReport)** for PR 19530 at commit [`5046240`](https://github.com/apache/spark/commit/504624017260f5f170502462316055f1c4709e1f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19530: [SPARK-22309][ML] Remove unused param in `LDAMode...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/19530 [SPARK-22309][ML] Remove unused param in `LDAModel.getTopicDistributionMethod` & destory `nodeToFeaturesBc` in RandomForest ## What changes were proposed in this pull request? Remove unused param in `LDAModel.getTopicDistributionMethod` destory `nodeToFeaturesBc` in RandomForest ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark lda_bc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19530.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19530 commit d4d2cad7d2ab7778f995790d69510d9f5425b1a4 Author: Zheng RuiFeng Date: 2017-10-19T02:42:04Z create pr commit 504624017260f5f170502462316055f1c4709e1f Author: Zheng RuiFeng Date: 2017-10-19T02:49:23Z update pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector to abs...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19508 **[Test build #82897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82897/testReport)** for PR 19508 at commit [`25003cf`](https://github.com/apache/spark/commit/25003cfd324071fed5eacc0fde6420f81516bcea). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 @MLnick Any more comments or thoughts on this I need to address? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19508: [SPARK-20783][SQL][Follow-up] Create ColumnVector...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19508#discussion_r145588024 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala --- @@ -495,6 +474,8 @@ private[columnar] case object DictionaryEncoding extends CompressionScheme { columnType.dataType match { case _: IntegerType => val dictionaryIds = columnVector.reserveDictionaryIds(capacity) + val intDictionary = dictionary.map(_.asInstanceOf[Int]) --- End diff -- Thanks @kiszk. I checked the comment and code and reverted this change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19459 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19459 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82894/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19459 **[Test build #82894 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82894/testReport)** for PR 19459 at commit [`3052f30`](https://github.com/apache/spark/commit/3052f3063e965d3636dd172a6981d93155b77fd2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r145586905 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.{SparkException, TaskContext} +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + +/** + * The logical plan for writing data into data source v2. + */ +case class WriteToDataSourceV2(writer: DataSourceV2Writer, query: LogicalPlan) extends LogicalPlan { + override def children: Seq[LogicalPlan] = Seq(query) + override def output: Seq[Attribute] = Nil +} + +/** + * The physical plan for writing data into data source v2. + */ +case class WriteToDataSourceV2Exec(writer: DataSourceV2Writer, query: SparkPlan) extends SparkPlan { + override def children: Seq[SparkPlan] = Seq(query) + override def output: Seq[Attribute] = Nil + + override protected def doExecute(): RDD[InternalRow] = { +val writeTask = writer match { + case w: SupportsWriteInternalRow => w.createInternalRowWriterFactory() + case _ => new RowToInternalRowDataWriterFactory(writer.createWriterFactory(), query.schema) +} + +val rdd = query.execute() +val messages = new Array[WriterCommitMessage](rdd.partitions.length) + +logInfo(s"Start processing data source writer: $writer. " + + s"The input RDD has ${messages.length} partitions.") + +try { + sparkContext.runJob( +rdd, +(context: TaskContext, iter: Iterator[InternalRow]) => + DataWritingSparkTask.run(writeTask, context, iter), +rdd.partitions.indices, +(index, message: WriterCommitMessage) => messages(index) = message + ) + + logInfo(s"Data source writer $writer is committing.") + writer.commit(messages) + logInfo(s"Data source writer $writer committed.") +} catch { + case cause: Throwable => +logError(s"Data source writer $writer is aborting.") +try { + writer.abort(messages) +} catch { + case t: Throwable => +logError(s"Data source writer $writer failed to abort.") +cause.addSuppressed(t) +throw new SparkException("Writing job failed.", cause) +} +logError(s"Data source writer $writer aborted.") +throw new SparkException("Writing job aborted.", cause) +} + +sparkContext.emptyRDD + } +} + +object DataWritingSparkTask extends Logging { --- End diff -- This is mainly for log. If we inline these codes to `WriteToDataSourceV2Exec` and make `WriteToDataSourceV2Exec` extends `Logging`, then we have to serialize and send `WriteToDataSourceV2Exec` to executor side for the logging. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19357 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19357 **[Test build #82896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82896/testReport)** for PR 19357 at commit [`4995e3c`](https://github.com/apache/spark/commit/4995e3c703f65788a997f88df10427aca291d70a). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class EquiHeightHistogram(height: Double, ehBuckets: Seq[EquiHeightBucket]) extends Histogram ` * `case class EquiHeightBucket(lo: Double, hi: Double, ndv: Long)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19357 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82896/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19357 **[Test build #82896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82896/testReport)** for PR 19357 at commit [`4995e3c`](https://github.com/apache/spark/commit/4995e3c703f65788a997f88df10427aca291d70a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r145579513 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java --- @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.writer; + +import java.io.Serializable; + +import org.apache.spark.annotation.InterfaceStability; + +/** + * A factory of {@link DataWriter} returned by {@link DataSourceV2Writer#createWriterFactory()}, + * which is responsible for creating and initializing the actual data writer at executor side. + * + * Note that, the writer factory will be serialized and sent to executors, then the data writer + * will be created on executors and do the actual writing. So {@link DataWriterFactory} must be + * serializable and {@link DataWriter} doesn't need to be. + */ +@InterfaceStability.Evolving +public interface DataWriterFactory extends Serializable { + + /** + * Returns a data writer to do the actual writing work. + * + * @param stageId The id of the Spark stage that runs the returned writer. + * @param partitionId The id of the RDD partition that the returned writer will process. + * @param attemptNumber The attempt number of the Spark task that runs the returned writer, which + * is usually 0 if the task is not a retried task or a speculative task. + */ + DataWriter createWriter(int stageId, int partitionId, int attemptNumber); --- End diff -- not sure why we have stageId here. I'd make it more generic, e.g. a string for some job id, and then some numeric value (64 bit long) for epoch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145577058 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'skip' (filter out rows with invalid data) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'skip' (filter out rows with invalid data) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`. + * + * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories. + * The output vectors are sparse. + * + * @see `StringIndexer` for converting categorical values into category indices + */ +@Since("2.3.0") +class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val uid: String) +extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("oneHotEncoder")) + + /** @group setParam */ + @Since("2.3.0") + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setDropLast(value: Boolean): this.type = set(dropLast, value) + + /** @group setParam */ + @Since("2.3.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + @Since("2.3.0") + override def transformSchema(schema: StructType): S
[GitHub] spark issue #13252: [SPARK-15473][SQL] CSV data source writes header for emp...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/13252 It's notbresolved yet. I am not working on this for now. Please take over this if you are willing to do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 There is one small hack in the way this was done, which is documented - see the comments and documentation on SharedSparkSession.initializeSession and SharedSparkContext.initializeContext. I would rather just have the session and context be lazy transient val's, which would work fine without this initialize call, but I didn't want to change the way tests currently ran without input. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().ap...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19505 I meant to ask if others agree with the current change as I could not see the ongoing discussion at that time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...
Github user nkronenfeld commented on the issue: https://github.com/apache/spark/pull/19529 I made my original changes here by using git mv PlanTest.scala PlanTestBase.scala git mv SQLTestUnit.scala SQLTestUnitBase.scala git mv SharedSQLContext.scala SharedSparkSession.scala and then created new PlanTest, SQLTestUnit, and SharedSQLContext files, under the assumption that most of the code would go in the base, and this would help git provide better history and continuity. I'm not sure if that was the right decision or not - probably it is with PlanTest and SQLTestUnit, possibly not with SharedSQLContext, but since the diff in the PR doesn't reflect the git mv properly, I'm not sure if it will make a difference either way. If reviewers wish me to redo this PR without the initial `git mv`, I'll be happy to. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19529 **[Test build #82895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82895/testReport)** for PR 19529 at commit [`802a958`](https://github.com/apache/spark/commit/802a958b640067b99fda0b2c8587dea5b8000495). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19529: Support alternative unit testing styles in extern...
GitHub user nkronenfeld opened a pull request: https://github.com/apache/spark/pull/19529 Support alternative unit testing styles in external applications ## What changes were proposed in this pull request? Support unit tests of external code (i.e., applications that use spark) using scalatest that don't want to use FunSuite. SharedSparkContext already supports this, but SharedSQLContext does not. I've introduced SharedSparkSession as a parent to SharedSQLContext, written in a way that it does support all scalatest styles. ## How was this patch tested? There are three new unit test suites added that just test using FunSpec, FlatSpec, and WordSpec. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nkronenfeld/spark alternative-style-tests-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19529.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19529 commit b9d41cd79f05f6c420d070ad07cdfa8f853fd461 Author: Nathan Kronenfeld Date: 2017-10-15T03:04:16Z Separate out the portion of SharedSQLContext that requires a FunSuite from the part that works with just any old test suite. commit 0d4bd97247a2d083c7de55663703b38a34298c9c Author: Nathan Kronenfeld Date: 2017-10-15T15:57:09Z Fix typo in trait name commit 83c44f1c24619e906af48180d0aace38587aa88d Author: Nathan Kronenfeld Date: 2017-10-15T15:57:42Z Add simple tests for each non-FunSuite test style commit e460612ec6f36e62d8d21d88c2344378ecba581a Author: Nathan Kronenfeld Date: 2017-10-15T16:20:44Z Document testing possibilities commit 0ee2aadf29b681b23bed356b14038525574204a5 Author: Nathan Kronenfeld Date: 2017-10-18T23:46:44Z Better documentation of testing procedures commit 802a958b640067b99fda0b2c8587dea5b8000495 Author: Nathan Kronenfeld Date: 2017-10-18T23:46:58Z Same initialization issue in SharedSparkContext as is in SharedSparkSession --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec
Github user sitalkedia commented on the issue: https://github.com/apache/spark/pull/18805 Created https://github.com/luben/zstd-jni/issues/47. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r145572702 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.v2 + +import org.apache.spark.{SparkException, TaskContext} +import org.apache.spark.internal.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder} +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.execution.SparkPlan +import org.apache.spark.sql.sources.v2.writer._ +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + +/** + * The logical plan for writing data into data source v2. + */ +case class WriteToDataSourceV2(writer: DataSourceV2Writer, query: LogicalPlan) extends LogicalPlan { + override def children: Seq[LogicalPlan] = Seq(query) + override def output: Seq[Attribute] = Nil +} + +/** + * The physical plan for writing data into data source v2. + */ +case class WriteToDataSourceV2Exec(writer: DataSourceV2Writer, query: SparkPlan) extends SparkPlan { + override def children: Seq[SparkPlan] = Seq(query) + override def output: Seq[Attribute] = Nil + + override protected def doExecute(): RDD[InternalRow] = { +val writeTask = writer match { + case w: SupportsWriteInternalRow => w.createInternalRowWriterFactory() + case _ => new RowToInternalRowDataWriterFactory(writer.createWriterFactory(), query.schema) +} + +val rdd = query.execute() +val messages = new Array[WriterCommitMessage](rdd.partitions.length) + +logInfo(s"Start processing data source writer: $writer. " + + s"The input RDD has ${messages.length} partitions.") + +try { + sparkContext.runJob( +rdd, +(context: TaskContext, iter: Iterator[InternalRow]) => + DataWritingSparkTask.run(writeTask, context, iter), +rdd.partitions.indices, +(index, message: WriterCommitMessage) => messages(index) = message + ) + + logInfo(s"Data source writer $writer is committing.") + writer.commit(messages) + logInfo(s"Data source writer $writer committed.") +} catch { + case cause: Throwable => +logError(s"Data source writer $writer is aborting.") +try { + writer.abort(messages) +} catch { + case t: Throwable => +logError(s"Data source writer $writer failed to abort.") +cause.addSuppressed(t) +throw new SparkException("Writing job failed.", cause) +} +logError(s"Data source writer $writer aborted.") +throw new SparkException("Writing job aborted.", cause) +} + +sparkContext.emptyRDD + } +} + +object DataWritingSparkTask extends Logging { --- End diff -- What is the reason we need to create a separate object for this function `run`? Why not moving it to `WriteToDataSourceV2Exec `? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19485 Could it be an option to leave a link back to the new page in the API doc to refer the options and remove the option list in API doc @gatorsmile and @liancheng? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145569192 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'skip' (filter out rows with invalid data) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'skip' (filter out rows with invalid data) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`. + * + * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories. + * The output vectors are sparse. + * + * @see `StringIndexer` for converting categorical values into category indices + */ +@Since("2.3.0") +class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val uid: String) +extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("oneHotEncoder")) + + /** @group setParam */ + @Since("2.3.0") + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setDropLast(value: Boolean): this.type = set(dropLast, value) + + /** @group setParam */ + @Since("2.3.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + @Since("2.3.0") + override def transformSchema(schema: StructType): S
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19459 **[Test build #82894 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82894/testReport)** for PR 19459 at commit [`3052f30`](https://github.com/apache/spark/commit/3052f3063e965d3636dd172a6981d93155b77fd2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145569379 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'skip' (filter out rows with invalid data) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'skip' (filter out rows with invalid data) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`. + * + * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories. + * The output vectors are sparse. + * + * @see `StringIndexer` for converting categorical values into category indices + */ +@Since("2.3.0") +class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val uid: String) +extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("oneHotEncoder")) + + /** @group setParam */ + @Since("2.3.0") + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setDropLast(value: Boolean): this.type = set(dropLast, value) + + /** @group setParam */ + @Since("2.3.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + @Since("2.3.0") + override def transformSchema(schema: StructType): S
[GitHub] spark pull request #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19505#discussion_r145568265 --- Diff: python/pyspark/sql/functions.py --- @@ -2192,67 +2208,82 @@ def pandas_udf(f=None, returnType=StringType()): :param f: user-defined function. A python function if used as a standalone function :param returnType: a :class:`pyspark.sql.types.DataType` object -The user-defined function can define one of the following transformations: - -1. One or more `pandas.Series` -> A `pandas.Series` - - This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and - :meth:`pyspark.sql.DataFrame.select`. - The returnType should be a primitive data type, e.g., `DoubleType()`. - The length of the returned `pandas.Series` must be of the same as the input `pandas.Series`. - - >>> from pyspark.sql.types import IntegerType, StringType - >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) - >>> @pandas_udf(returnType=StringType()) - ... def to_upper(s): - ... return s.str.upper() - ... - >>> @pandas_udf(returnType="integer") - ... def add_one(x): - ... return x + 1 - ... - >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) - >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ - ... .show() # doctest: +SKIP - +--+--++ - |slen(name)|to_upper(name)|add_one(age)| - +--+--++ - | 8| JOHN DOE| 22| - +--+--++ - -2. A `pandas.DataFrame` -> A `pandas.DataFrame` - - This udf is only used with :meth:`pyspark.sql.GroupedData.apply`. - The returnType should be a :class:`StructType` describing the schema of the returned - `pandas.DataFrame`. - - >>> df = spark.createDataFrame( - ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], - ... ("id", "v")) - >>> @pandas_udf(returnType=df.schema) - ... def normalize(pdf): - ... v = pdf.v - ... return pdf.assign(v=(v - v.mean()) / v.std()) - >>> df.groupby('id').apply(normalize).show() # doctest: +SKIP - +---+---+ - | id| v| - +---+---+ - | 1|-0.7071067811865475| - | 1| 0.7071067811865475| - | 2|-0.8320502943378437| - | 2|-0.2773500981126146| - | 2| 1.1094003924504583| - +---+---+ - - .. note:: This type of udf cannot be used with functions such as `withColumn` or `select` - because it defines a `DataFrame` transformation rather than a `Column` - transformation. - - .. seealso:: :meth:`pyspark.sql.GroupedData.apply` +The user-defined function can define the following transformation: + +One or more `pandas.Series` -> A `pandas.Series` + +This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and +:meth:`pyspark.sql.DataFrame.select`. +The returnType should be a primitive data type, e.g., `DoubleType()`. +The length of the returned `pandas.Series` must be of the same as the input `pandas.Series`. + +>>> from pyspark.sql.types import IntegerType, StringType +>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) +>>> @pandas_udf(returnType=StringType()) +... def to_upper(s): +... return s.str.upper() +... +>>> @pandas_udf(returnType="integer") +... def add_one(x): +... return x + 1 +... +>>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) +>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ +... .show() # doctest: +SKIP ++--+--++ +|slen(name)|to_upper(name)|add_one(age)| ++--+--++ +| 8| JOHN DOE| 22| ++--+--++ + +.. note:: The user-defined function must be deterministic. +""" +return _create_udf(f, returnType=returnType, pythonUdfType=PythonUdfType.PANDAS_UDF) + + +@since(2.3) +def pandas_grouped_udf(f=None, returnType=StructType()): --- End diff -- Post it in another PR https://github.com/apache/spark/pull/19517? This discussion thread will be collapsed when Takuya made a code change. --- ---
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r145567644 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,439 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model, Transformer} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderParams extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'skip' (filter out rows with invalid data) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'skip' (filter out rows with invalid data) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`. + * + * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories. + * The output vectors are sparse. + * + * @see `StringIndexer` for converting categorical values into category indices + */ +@Since("2.3.0") +class OneHotEncoderEstimator @Since("2.3.0") (@Since("2.3.0") override val uid: String) +extends Estimator[OneHotEncoderModel] with OneHotEncoderParams with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("oneHotEncoder")) + + /** @group setParam */ + @Since("2.3.0") + def setInputCols(values: Array[String]): this.type = set(inputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setOutputCols(values: Array[String]): this.type = set(outputCols, values) + + /** @group setParam */ + @Since("2.3.0") + def setDropLast(value: Boolean): this.type = set(dropLast, value) + + /** @group setParam */ + @Since("2.3.0") + def setHandleInvalid(value: String): this.type = set(handleInvalid, value) + + @Since("2.3.0") + override def transformSchema(schema: StructType): S
[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19524 Sure, thanks. Let me update soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19485: [SPARK-20055] [Docs] Added documentation for loading csv...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19485 My only worry is duplication and we would have another place to update the doc for options. Others sound okay to me too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19269: [SPARK-22026][SQL] data source v2 write path
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19269#discussion_r145563004 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriter.java --- @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2.writer; + +import org.apache.spark.annotation.InterfaceStability; + +/** + * A data writer returned by {@link DataWriterFactory#createWriter(int, int, int)} and is + * responsible for writing data for an input RDD partition. + * + * One Spark task has one exclusive data writer, so there is no thread-safe concern. + * + * {@link #write(Object)} is called for each record in the input RDD partition. If one record fails + * the {@link #write(Object)}, {@link #abort()} is called afterwards and the remaining records will + * not be processed. If all records are successfully written, {@link #commit()} is called. + * + * If this data writer successes(all records are successfully written and {@link #commit()} --- End diff -- `successes ` -> `succeeds` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19505: [WIP][SPARK-20396][SQL][PySpark][FOLLOW-UP] group...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19505#discussion_r145552397 --- Diff: python/pyspark/sql/functions.py --- @@ -2192,67 +2208,82 @@ def pandas_udf(f=None, returnType=StringType()): :param f: user-defined function. A python function if used as a standalone function :param returnType: a :class:`pyspark.sql.types.DataType` object -The user-defined function can define one of the following transformations: - -1. One or more `pandas.Series` -> A `pandas.Series` - - This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and - :meth:`pyspark.sql.DataFrame.select`. - The returnType should be a primitive data type, e.g., `DoubleType()`. - The length of the returned `pandas.Series` must be of the same as the input `pandas.Series`. - - >>> from pyspark.sql.types import IntegerType, StringType - >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) - >>> @pandas_udf(returnType=StringType()) - ... def to_upper(s): - ... return s.str.upper() - ... - >>> @pandas_udf(returnType="integer") - ... def add_one(x): - ... return x + 1 - ... - >>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) - >>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ - ... .show() # doctest: +SKIP - +--+--++ - |slen(name)|to_upper(name)|add_one(age)| - +--+--++ - | 8| JOHN DOE| 22| - +--+--++ - -2. A `pandas.DataFrame` -> A `pandas.DataFrame` - - This udf is only used with :meth:`pyspark.sql.GroupedData.apply`. - The returnType should be a :class:`StructType` describing the schema of the returned - `pandas.DataFrame`. - - >>> df = spark.createDataFrame( - ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], - ... ("id", "v")) - >>> @pandas_udf(returnType=df.schema) - ... def normalize(pdf): - ... v = pdf.v - ... return pdf.assign(v=(v - v.mean()) / v.std()) - >>> df.groupby('id').apply(normalize).show() # doctest: +SKIP - +---+---+ - | id| v| - +---+---+ - | 1|-0.7071067811865475| - | 1| 0.7071067811865475| - | 2|-0.8320502943378437| - | 2|-0.2773500981126146| - | 2| 1.1094003924504583| - +---+---+ - - .. note:: This type of udf cannot be used with functions such as `withColumn` or `select` - because it defines a `DataFrame` transformation rather than a `Column` - transformation. - - .. seealso:: :meth:`pyspark.sql.GroupedData.apply` +The user-defined function can define the following transformation: + +One or more `pandas.Series` -> A `pandas.Series` + +This udf is used with :meth:`pyspark.sql.DataFrame.withColumn` and +:meth:`pyspark.sql.DataFrame.select`. +The returnType should be a primitive data type, e.g., `DoubleType()`. +The length of the returned `pandas.Series` must be of the same as the input `pandas.Series`. + +>>> from pyspark.sql.types import IntegerType, StringType +>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) +>>> @pandas_udf(returnType=StringType()) +... def to_upper(s): +... return s.str.upper() +... +>>> @pandas_udf(returnType="integer") +... def add_one(x): +... return x + 1 +... +>>> df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age")) +>>> df.select(slen("name").alias("slen(name)"), to_upper("name"), add_one("age")) \\ +... .show() # doctest: +SKIP ++--+--++ +|slen(name)|to_upper(name)|add_one(age)| ++--+--++ +| 8| JOHN DOE| 22| ++--+--++ + +.. note:: The user-defined function must be deterministic. +""" +return _create_udf(f, returnType=returnType, pythonUdfType=PythonUdfType.PANDAS_UDF) + + +@since(2.3) +def pandas_grouped_udf(f=None, returnType=StructType()): --- End diff -- Here is a summary of the current proposal: I. Use only `pandas_udf` -- The main issues with this approach as a few people comment out is that it is hard to know
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19439 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82893/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19439 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19439 **[Test build #82893 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82893/testReport)** for PR 19439 at commit [`d114e40`](https://github.com/apache/spark/commit/d114e4017c06b7a8c1408ef8d289883c990c5046). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19383: [SPARK-20643][core] Add listener implementation to colle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19383 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode
Github user imaxxs commented on the issue: https://github.com/apache/spark/pull/18784 I work with @ArtRand and @susanxhuynh Fine grained mode has been deprecated for a while. If it is standard procedure to wait till next release and if that is 3.0 we should wait till Spark 3.0 release. @skonto @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19383: [SPARK-20643][core] Add listener implementation to colle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19383 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82890/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19383: [SPARK-20643][core] Add listener implementation to colle...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19383 **[Test build #82890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82890/testReport)** for PR 19383 at commit [`ca43746`](https://github.com/apache/spark/commit/ca4374612abdaab7cd1c449e65d87878c68e15d2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19528: [SPARK-20393] [Core] Existing patch applied to 1.6 branc...
Github user ambauma commented on the issue: https://github.com/apache/spark/pull/19528 Understood. Working on porting to 2.0... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19509 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82888/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19509 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19509 **[Test build #82888 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82888/testReport)** for PR 19509 at commit [`94223be`](https://github.com/apache/spark/commit/94223beaeffa9793fc1529bafd8a65b4b3185d7a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19521 Thank you, @rxin ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19521 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18692 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82892/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18692 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18692 **[Test build #82892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82892/testReport)** for PR 18692 at commit [`b69185c`](https://github.com/apache/spark/commit/b69185c2a7466b69a3f244a257449dbf1dd0ee21). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18692: [SPARK-21417][SQL] Infer join conditions using propagate...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18692 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org