[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19820 Would it be possible to add which tests caused this exception in the description? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19820 LGTM except one comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19814: [SPARK-22484][DOC] Document PySpark DataFrame csv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19814#discussion_r153062402 --- Diff: python/pyspark/sql/readwriter.py --- @@ -828,8 +828,7 @@ def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=No set, it uses the default value, ``,``. :param quote: sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default - value, ``"``. If you would like to turn off quotations, you need to set an - empty string. + value, ``"``. If empty string is set, it uses ``u``. --- End diff -- If there are doc changes to be done for options here, let's make sure chaning all. Quick and easy way will be just `grep` and replace. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19814: [SPARK-22484][DOC] Document PySpark DataFrame csv writer...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19814 Can we turn off it as documented? We could try to open a JIRA in Univocity if this functionality is not there and incorporate the change in Spark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19815 LGTM except one comment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19815#discussion_r153061379 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) { @Override public UTF8String getUTF8String(int rowId) { --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19817: [SPARK-22603][SQL] Fix 64KB JVM bytecode limit pr...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19817#discussion_r153060800 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -1372,19 +1372,30 @@ case class FormatString(children: Expression*) extends Expression with ImplicitC val pattern = children.head.genCode(ctx) val argListGen = children.tail.map(x => (x.dataType, x.genCode(ctx))) -val argListCode = argListGen.map(_._2.code + "\n") - -val argListString = argListGen.foldLeft("")((s, v) => { - val nullSafeString = +val argList = ctx.freshName("argLists") +val numArgLists = argListGen.length +val argListCode = argListGen.zipWithIndex.map { case(v, index) => + val value = if (ctx.boxedType(v._1) != ctx.javaType(v._1)) { // Java primitives get boxed in order to allow null values. s"(${v._2.isNull}) ? (${ctx.boxedType(v._1)}) null : " + s"new ${ctx.boxedType(v._1)}(${v._2.value})" } else { s"(${v._2.isNull}) ? null : ${v._2.value}" } - s + "," + nullSafeString -}) + s""" + ${v._2.code} + $argList[$index] = $value; + """ +} +val argListCodes = if (ctx.INPUT_ROW != null && ctx.currentVars == null) { + ctx.splitExpressions( +expressions = argListCode, +funcName = "valueFormatString", +arguments = ("InternalRow", ctx.INPUT_ROW) :: ("Object[]", argList) :: Nil) +} else { + argListCode.mkString("\n") +} --- End diff -- Sure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18692#discussion_r153060560 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } } + +/** + * A rule that eliminates CROSS joins by inferring join conditions from propagated constraints. + * + * The optimization is applicable only to CROSS joins. For other join types, adding inferred join + * conditions would potentially shuffle children as child node's partitioning won't satisfy the JOIN + * node's requirements which otherwise could have. + * + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right + * relation has 'b = 1', the rule infers 'a = b' as a join predicate. + */ +object EliminateCrossJoin extends Rule[LogicalPlan] with PredicateHelper { + + def apply(plan: LogicalPlan): LogicalPlan = { +if (SQLConf.get.constraintPropagationEnabled) { + eliminateCrossJoin(plan) +} else { + plan +} + } + + private def eliminateCrossJoin(plan: LogicalPlan): LogicalPlan = plan transform { +case join@Join(leftPlan, rightPlan, Cross, None) => --- End diff -- Nit: `join@Join` -> `join @ Join` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18692#discussion_r153060551 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } } + +/** + * A rule that eliminates CROSS joins by inferring join conditions from propagated constraints. + * + * The optimization is applicable only to CROSS joins. For other join types, adding inferred join + * conditions would potentially shuffle children as child node's partitioning won't satisfy the JOIN + * node's requirements which otherwise could have. + * + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right + * relation has 'b = 1', the rule infers 'a = b' as a join predicate. --- End diff -- > For instance, given a CROSS join with the constraint 'a = 1' from the left child and the constraint 'b = 1' from the right child, this rule infers a new join predicate 'a = b' and convert it to an Inner join. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18692: [SPARK-21417][SQL] Infer join conditions using pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18692#discussion_r153060595 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -152,3 +152,99 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { if (j.joinType == newJoinType) f else Filter(condition, j.copy(joinType = newJoinType)) } } + +/** + * A rule that eliminates CROSS joins by inferring join conditions from propagated constraints. + * + * The optimization is applicable only to CROSS joins. For other join types, adding inferred join + * conditions would potentially shuffle children as child node's partitioning won't satisfy the JOIN + * node's requirements which otherwise could have. + * + * For instance, if there is a CROSS join, where the left relation has 'a = 1' and the right + * relation has 'b = 1', the rule infers 'a = b' as a join predicate. + */ +object EliminateCrossJoin extends Rule[LogicalPlan] with PredicateHelper { + + def apply(plan: LogicalPlan): LogicalPlan = { +if (SQLConf.get.constraintPropagationEnabled) { + eliminateCrossJoin(plan) +} else { + plan +} + } + + private def eliminateCrossJoin(plan: LogicalPlan): LogicalPlan = plan transform { +case join@Join(leftPlan, rightPlan, Cross, None) => + val leftConstraints = join.constraints.filter(_.references.subsetOf(leftPlan.outputSet)) + val rightConstraints = join.constraints.filter(_.references.subsetOf(rightPlan.outputSet)) + val inferredJoinPredicates = inferJoinPredicates(leftConstraints, rightConstraints) + val joinConditionOpt = inferredJoinPredicates.reduceOption(And) + if (joinConditionOpt.isDefined) Join(leftPlan, rightPlan, Inner, joinConditionOpt) else join + } + + private def inferJoinPredicates( + leftConstraints: Set[Expression], + rightConstraints: Set[Expression]): Set[EqualTo] = { + +// iterate through the left constraints and build a hash map that points semantically +// equivalent expressions into attributes +val emptyEquivalenceMap = Map.empty[SemanticExpression, Set[Attribute]] +val equivalenceMap = leftConstraints.foldLeft(emptyEquivalenceMap) { case (map, constraint) => + constraint match { +case EqualTo(attr: Attribute, expr: Expression) => + updateEquivalenceMap(map, attr, expr) +case EqualTo(expr: Expression, attr: Attribute) => + updateEquivalenceMap(map, attr, expr) +case _ => map + } +} + +// iterate through the right constraints and infer join conditions using the equivalence map +rightConstraints.foldLeft(Set.empty[EqualTo]) { case (joinConditions, constraint) => + constraint match { +case EqualTo(attr: Attribute, expr: Expression) => + appendJoinConditions(attr, expr, equivalenceMap, joinConditions) +case EqualTo(expr: Expression, attr: Attribute) => + appendJoinConditions(attr, expr, equivalenceMap, joinConditions) +case _ => joinConditions + } +} + } + + private def updateEquivalenceMap( + equivalenceMap: Map[SemanticExpression, Set[Attribute]], + attr: Attribute, + expr: Expression): Map[SemanticExpression, Set[Attribute]] = { + +val equivalentAttrs = equivalenceMap.getOrElse(expr, Set.empty[Attribute]) +if (equivalentAttrs.contains(attr)) { + equivalenceMap +} else { + equivalenceMap.updated(expr, equivalentAttrs + attr) +} + } + + private def appendJoinConditions( + attr: Attribute, + expr: Expression, + equivalenceMap: Map[SemanticExpression, Set[Attribute]], + joinConditions: Set[EqualTo]): Set[EqualTo] = { + +equivalenceMap.get(expr) match { + case Some(equivalentAttrs) => joinConditions ++ equivalentAttrs.map(EqualTo(attr, _)) + case None => joinConditions +} + } + + // the purpose of this class is to treat 'a === 1 and 1 === 'a as the same expressions + implicit class SemanticExpression(private val expr: Expression) { --- End diff -- Can we reuse `EquivalentExpressions`? You can search the code base and see how the others use it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19817: [SPARK-22603][SQL] Fix 64KB JVM bytecode limit pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19817#discussion_r153060489 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala --- @@ -1372,19 +1372,30 @@ case class FormatString(children: Expression*) extends Expression with ImplicitC val pattern = children.head.genCode(ctx) val argListGen = children.tail.map(x => (x.dataType, x.genCode(ctx))) -val argListCode = argListGen.map(_._2.code + "\n") - -val argListString = argListGen.foldLeft("")((s, v) => { - val nullSafeString = +val argList = ctx.freshName("argLists") +val numArgLists = argListGen.length +val argListCode = argListGen.zipWithIndex.map { case(v, index) => + val value = if (ctx.boxedType(v._1) != ctx.javaType(v._1)) { // Java primitives get boxed in order to allow null values. s"(${v._2.isNull}) ? (${ctx.boxedType(v._1)}) null : " + s"new ${ctx.boxedType(v._1)}(${v._2.value})" } else { s"(${v._2.isNull}) ? null : ${v._2.value}" } - s + "," + nullSafeString -}) + s""" + ${v._2.code} + $argList[$index] = $value; + """ +} +val argListCodes = if (ctx.INPUT_ROW != null && ctx.currentVars == null) { + ctx.splitExpressions( +expressions = argListCode, +funcName = "valueFormatString", +arguments = ("InternalRow", ctx.INPUT_ROW) :: ("Object[]", argList) :: Nil) +} else { + argListCode.mkString("\n") +} --- End diff -- Could you create a `splitExpressions` in `CodegenContext` for avoiding the duplicate codes, like ```Scala if (ctx.INPUT_ROW != null && ctx.currentVars == null) { ctx.splitExpressions(...) } else { inputs.mkString("\n") } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19815#discussion_r153060413 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) { @Override public UTF8String getUTF8String(int rowId) { if (dictionary == null) { - ColumnarArray a = getByteArray(rowId); - return UTF8String.fromBytes(a.byteArray, a.byteArrayOffset, a.length); + return childColumns[0].getUTF8String0(getArrayOffset(rowId), getArrayLength(rowId)); } else { byte[] bytes = dictionary.decodeToBinary(dictionaryIds.getDictId(rowId)); return UTF8String.fromBytes(bytes); --- End diff -- hmm, but looks `decodeToBinary` will copy byte data? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19815#discussion_r153060336 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java --- @@ -324,24 +315,27 @@ public void putDecimal(int rowId, Decimal value, int precision) { @Override public UTF8String getUTF8String(int rowId) { --- End diff -- Shall we add comment that `getUTF8String` reuse the data in column vector? It seems different than other getXXX APIs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19816 Let me maybe try to deal with this separately .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19816 I tested 16, 32 and 64 too: 16: ``` Failed - 1. Failure: group by, agg functions (@test_sparkSQL.R#1839) 30 not equal to collect(max(gd))[2, 2]. 1/1 mismatches [1] 30 - NA == NA 2. Failure: pivot GroupedData column (@test_sparkSQL.R#1921) --- `sum1` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0004964011 Component âPythonâ: Mean relative difference: 0.2857143 Component âRâ: Mean relative difference: 0.08695652 3. Failure: pivot GroupedData column (@test_sparkSQL.R#1922) --- `sum2` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0004964011 Component âPythonâ: Mean relative difference: 0.2857143 Component âRâ: Mean relative difference: 0.08695652 4. Failure: pivot GroupedData column (@test_sparkSQL.R#1923) --- `sum3` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0004964011 Component âPythonâ: Mean relative difference: 0.2857143 Component âRâ: Mean relative difference: 0.08695652 5. Failure: pivot GroupedData column (@test_sparkSQL.R#1924) --- `sum4` not equal to correct_answer[, c("year", "R")]. Component âyearâ: Mean relative difference: 0.0004964011 Component âRâ: Mean relative difference: 0.08695652 DONE === ``` 32: ``` Failed - 1. Failure: spark.als (@test_mllib_recommendation.R#36) predictions$prediction not equal to c(-0.1380762, 2.6258414, -1.5018409). 3/3 mismatches (average diff: 2.75) [1] 2.626 - -0.138 == 2.76 [2] -1.502 - 2.626 == -4.13 [3] -0.138 - -1.502 == 1.36 2. Failure: group by, agg functions (@test_sparkSQL.R#1839) 30 not equal to collect(max(gd))[2, 2]. 1/1 mismatches [1] 30 - 19 == 11 3. Failure: pivot GroupedData column (@test_sparkSQL.R#1921) --- `sum1` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0009925558 Component âPythonâ: Mean relative difference: 0.3783784 Component âRâ: Mean relative difference: 0.625 4. Failure: pivot GroupedData column (@test_sparkSQL.R#1922) --- `sum2` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0009925558 Component âPythonâ: Mean relative difference: 0.3783784 Component âRâ: Mean relative difference: 0.625 5. Failure: pivot GroupedData column (@test_sparkSQL.R#1923) --- `sum3` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0009925558 Component âPythonâ: Mean relative difference: 0.3783784 Component âRâ: Mean relative difference: 0.625 6. Failure: pivot GroupedData column (@test_sparkSQL.R#1924) --- `sum4` not equal to correct_answer[, c("year", "R")]. Component âyearâ: Mean relative difference: 0.0009925558 Component âRâ: Mean relative difference: 0.625 DONE === ``` 64: ``` Failed - 1. Failure: spark.als (@test_mllib_recommendation.R#36) predictions$prediction not equal to c(-0.1380762, 2.6258414, -1.5018409). 2/3 mismatches (average diff: 1.36) [1] -1.502 - -0.138 == -1.36 [3] -0.138 - -1.502 == 1.36 2. Failure: group by, agg functions (@test_sparkSQL.R#1839) 30 not equal to collect(max(gd))[2, 2]. 1/1 mismatches [1] 30 - 19 == 11 3. Failure: pivot GroupedData column (@test_sparkSQL.R#1921) --- `sum1` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0009925558 Component âPythonâ: Mean relative difference: 0.3783784 Component âRâ: Mean relative difference: 0.625 4. Failure: pivot GroupedData column (@test_sparkSQL.R#1922) --- `sum2` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0009925558 Component âPythonâ: Mean relative difference: 0.3783784 Component âRâ: Mean relative difference: 0.625 5. Failure: pivot GroupedData column (@test_sparkSQL.R#
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19783 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84190/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19783 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19783 **[Test build #84190 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84190/testReport)** for PR 19783 at commit [`8e5d04e`](https://github.com/apache/spark/commit/8e5d04ef4687917b2487e5b9267bc40ba3ef33a1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19788: [SPARK-9853][Core] Optimize shuffle fetch of contiguous ...
Github user yucai commented on the issue: https://github.com/apache/spark/pull/19788 Currently users need update their external shuffle service for this feature, because we change the format of `ShuffleBlockId`, which is supposed to be parsed by external shuffle service. I am trying to introduce a new configure like `spark.shuffle.continuousFetch`. By default, it is `false`, Spark will still use `ShuffleBlockId` as always, and when it is set `true` intentionally, Spark will use `ContinuousShuffleBlockIds`. In this way, users no need update their external shuffle service if they only want to work with `ShuffleBlockId`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r153057489 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java --- @@ -165,13 +165,23 @@ public ManagedBuffer getBlockData( String execId, int shuffleId, int mapId, - int reduceId) { + int reduceId, + int length) { ExecutorShuffleInfo executor = executors.get(new AppExecId(appId, execId)); if (executor == null) { throw new RuntimeException( String.format("Executor is not registered (appId=%s, execId=%s)", appId, execId)); } -return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId); +return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId, length); + } + + public ManagedBuffer getBlockData( --- End diff -- Thanks, will update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19816 Ah, makes sense. Let me at least try other numbers and be back soon anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19783: [SPARK-21322][SQL] support histogram in filter cardinali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19783 **[Test build #84190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84190/testReport)** for PR 19783 at commit [`8e5d04e`](https://github.com/apache/spark/commit/8e5d04ef4687917b2487e5b9267bc40ba3ef33a1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19754: [BUILD] update release scripts
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19754 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19754: [BUILD] update release scripts
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19754 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84189/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19754: [BUILD] update release scripts
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19754 **[Test build #84189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84189/testReport)** for PR 19754 at commit [`71ec15b`](https://github.com/apache/spark/commit/71ec15b698af18093143ea5f3fa88bf86dc26b4e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19754: [BUILD] update release scripts
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19754 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84188/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19754: [BUILD] update release scripts
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19754 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19754: [BUILD] update release scripts
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19754 **[Test build #84188 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84188/testReport)** for PR 19754 at commit [`41f4493`](https://github.com/apache/spark/commit/41f449354a91e8e10e8c39f63a2542eb6193). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19820 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19820 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84187/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19820 **[Test build #84187 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84187/testReport)** for PR 19820 at commit [`a0b6658`](https://github.com/apache/spark/commit/a0b6658a65d98158a8bf610a57edb240b99c8839). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84186/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84186 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84186/testReport)** for PR 19811 at commit [`d01fcb1`](https://github.com/apache/spark/commit/d01fcb10f877fdc5635acaedc2676e5e3d18c772). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19754: [BUILD] update release scripts
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/19754#discussion_r153052719 --- Diff: dev/create-release/release-build.sh --- @@ -392,6 +375,7 @@ if [[ "$1" == "publish-release" ]]; then find . -type f |grep -v \.jar |grep -v \.pom | xargs rm echo "Creating hash and signature files" + # this must have .asc, .md5 and .sha1 - it really doesn't like anything else there --- End diff -- it rejects it when there are "extra files" of the sha512 extension. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19816 I think likely this is a general problem with default shuffle size being too big on limited resources when in tests (particularly on Windows/AppVeyor). The additional test failures are expected, I guess, since less partition can alter the result. Let's do a more isolated change for now? But I agree new tests can cause problems in the future. Perhaps we could identify a size that is smaller than the default that could get "close enough" to the result we have and set that for tests by default (16? 32? 64?). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19754: [BUILD] update release scripts
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19754 **[Test build #84189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84189/testReport)** for PR 19754 at commit [`71ec15b`](https://github.com/apache/spark/commit/71ec15b698af18093143ea5f3fa88bf86dc26b4e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19754: [BUILD] update release scripts
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19754 **[Test build #84188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84188/testReport)** for PR 19754 at commit [`41f4493`](https://github.com/apache/spark/commit/41f449354a91e8e10e8c39f63a2542eb6193). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19820: [SPARK-22607][BUILD] Set large stack size consistently f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19820 **[Test build #84187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84187/testReport)** for PR 19820 at commit [`a0b6658`](https://github.com/apache/spark/commit/a0b6658a65d98158a8bf610a57edb240b99c8839). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19820: [SPARK-22607][BUILD] Set large stack size consist...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/19820 [SPARK-22607][BUILD] Set large stack size consistently for tests to avoid StackOverflowError ## What changes were proposed in this pull request? Set `-ea` and `-Xss4m` consistently for tests ## How was this patch tested? Existing tests. Manually verified it resolves the StackOverflowError this intends to resolve. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-22607 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19820.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19820 commit a0b6658a65d98158a8bf610a57edb240b99c8839 Author: Sean Owen Date: 2017-11-25T18:25:29Z Set -ea and -Xss4m consistently for tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19815: [SPARK-22602][SQL] remove ColumnVector#loadBytes
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19815 I will look this Sunday. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19811 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84185/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84185 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84185/testReport)** for PR 19811 at commit [`d8a9f9e`](https://github.com/apache/spark/commit/d8a9f9e91d54353b19eec0fec585b3ae5143eb34). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84186/testReport)** for PR 19811 at commit [`d01fcb1`](https://github.com/apache/spark/commit/d01fcb10f877fdc5635acaedc2676e5e3d18c772). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r153049711 --- Diff: common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolverSuite.java --- @@ -110,6 +110,13 @@ public void testSortShuffleBlocks() throws IOException { new InputStreamReader(block1Stream, StandardCharsets.UTF_8)); block1Stream.close(); assertEquals(sortBlock1, block1); + +InputStream block01Stream = +resolver.getBlockData("app0", "exec0", 0, 0, 0, 2).createInputStream(); +String block01 = CharStreams.toString( +new InputStreamReader(block01Stream, StandardCharsets.UTF_8)); --- End diff -- Thanks, updated! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r153049707 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockId.scala --- @@ -52,8 +52,9 @@ case class RDDBlockId(rddId: Int, splitIndex: Int) extends BlockId { // Format of the shuffle block ids (including data and index) should be kept in sync with // org.apache.spark.network.shuffle.ExternalShuffleBlockResolver#getBlockData(). @DeveloperApi -case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int) extends BlockId { - override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId +case class ShuffleBlockId(shuffleId: Int, mapId: Int, reduceId: Int, length: Int = 1) + extends BlockId { + override def name: String = "shuffle_" + shuffleId + "_" + mapId + "_" + reduceId + "_" + length --- End diff -- `ContinuousShuffleBlockIds` looks like a good idea, let me try. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19788: [SPARK-9853][Core] Optimize shuffle fetch of cont...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/19788#discussion_r153049650 --- Diff: core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala --- @@ -196,12 +196,14 @@ private[spark] class IndexShuffleBlockResolver( override def getBlockData(blockId: ShuffleBlockId): ManagedBuffer = { // The block is actually going to be a range of a single map output file for this map, so // find out the consolidated file, then the offset within that from our index +logDebug(s"Fetch block data for $blockId") --- End diff -- Without this info, it looks hard to know continuous shuffle block read really happen, and I found `getLocalBytes` had similar debug info also. ``` logDebug(s"Getting local block $blockId as bytes") ``` How about keeping it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19811: [WIP][SPARK-18016][SQL] Code Generation: Constant Pool L...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19811 **[Test build #84185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84185/testReport)** for PR 19811 at commit [`d8a9f9e`](https://github.com/apache/spark/commit/d8a9f9e91d54353b19eec0fec585b3ae5143eb34). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19798: [SPARK-22583] First delegation token renewal time...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19798 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19798: [SPARK-22583] First delegation token renewal time is not...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19798 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19518: [SPARK-18016][SQL][CATALYST] Code Generation: Constant P...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19518 > I think ldc is 2 bytes and ldc_w is 3 bytes? You are right, thanks, updated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19816 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19816 **[Test build #84184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84184/testReport)** for PR 19816 at commit [`41ffb54`](https://github.com/apache/spark/commit/41ffb54d391009e9f34b92af1f7686ff544e8b46). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19816 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84184/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19819 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84183/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19819 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19819 **[Test build #84183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84183/testReport)** for PR 19819 at commit [`aa02d89`](https://github.com/apache/spark/commit/aa02d8904fcbaa91df47ac224d90345bd555a372). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/18906 Thanks for the background Bryan :) So it sounds like from an API perspective it makes sense to support this in the future possibly on the Pandas UDFs (but the code isn't there on the JVM side). I'd say if @ptkool has the time it might make sense to match the scala API on the current UDFs its easier when we want to add this to the Panda's UDFs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][R][FOLLOWUP] Reduce shuffle partitions run...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19816 **[Test build #84184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84184/testReport)** for PR 19816 at commit [`41ffb54`](https://github.com/apache/spark/commit/41ffb54d391009e9f34b92af1f7686ff544e8b46). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][FOLLOWUP][R] Reduce shuffle partitions run...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19816 Not sure. Let me know if you have a preference @felixcheung. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19816: [SPARK-21693][FOLLOWUP][R] Reduce shuffle partitions run...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19816 @felixcheung, I just tried to lower this by default and ran. Seems some tests are being failed. For example, if we lower`spark.sql.shuffle.partitions` to 5, these fail additionally: ``` Failed - 1. Failure: spark.als (@test_mllib_recommendation.R#36) predictions$prediction not equal to c(-0.1380762, 2.6258414, -1.5018409). 3/3 mismatches (average diff: 2.75) [1] 2.626 - -0.138 == 2.76 [2] -1.502 - 2.626 == -4.13 [3] -0.138 - -1.502 == 1.36 2. Failure: pivot GroupedData column (@test_sparkSQL.R#1921) --- `sum1` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0004961548 Component âPythonâ: Mean relative difference: 0.0952381 Component âRâ: Mean relative difference: 0.5454545 3. Failure: pivot GroupedData column (@test_sparkSQL.R#1922) --- `sum2` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0004961548 Component âPythonâ: Mean relative difference: 0.0952381 Component âRâ: Mean relative difference: 0.5454545 4. Failure: pivot GroupedData column (@test_sparkSQL.R#1923) --- `sum3` not equal to `correct_answer`. Component âyearâ: Mean relative difference: 0.0004961548 Component âPythonâ: Mean relative difference: 0.0952381 Component âRâ: Mean relative difference: 0.5454545 5. Failure: pivot GroupedData column (@test_sparkSQL.R#1924) --- `sum4` not equal to correct_answer[, c("year", "R")]. Component âyearâ: Mean relative difference: 0.0004961548 Component âRâ: Mean relative difference: 0.5454545 ``` Shuffle + R worker cases look not quite frequent (to be clear, just shuffle without R will be fine IIUC). I don't have a strong opinion on lowering because .. if we don't lower, some tests in the future could cause such problem again vs if we should lower, the required change looks quite larger and this case might be not quite frequent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19819: [SPARK-22606][Streaming]Add threadId to the CachedKafkaC...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19819 **[Test build #84183 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84183/testReport)** for PR 19819 at commit [`aa02d89`](https://github.com/apache/spark/commit/aa02d8904fcbaa91df47ac224d90345bd555a372). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19819: [SPARK-22606][Streaming]Add threadId to the Cache...
GitHub user eatoncys opened a pull request: https://github.com/apache/spark/pull/19819 [SPARK-22606][Streaming]Add threadId to the CachedKafkaConsumer key ## What changes were proposed in this pull request? If the value of param 'spark.streaming.concurrentJobs' is more than one, and the value of param 'spark.executor.cores' is more than one, there may be two or more tasks in one executor will use the same kafka consumer at the same time, then it will throw an exception: "KafkaConsumer is not safe for multi-threaded access"; for example: spark.streaming.concurrentJobs=2 spark.executor.cores=2 spark.cores.max=2 if there is only one topic with one partition('topic1',0) to consume, there will be two jobs to run at the same time, and they will use the same cacheKey('groupid','topic1',0) to get the CachedKafkaConsumer from the cache list of' private var cache: ju.LinkedHashMap[CacheKey, CachedKafkaConsumer[_, _]]' , then it will get the same CachedKafkaConsumer. this PR add threadId to the CachedKafkaConsumer key to prevent two thread using a consumer at the same time. ## How was this patch tested? existing ut test You can merge this pull request into a Git repository by running: $ git pull https://github.com/eatoncys/spark kafka Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19819.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19819 commit aa02d8904fcbaa91df47ac224d90345bd555a372 Author: 10129659 Date: 2017-11-25T08:15:17Z Add threadId to CachedKafkaConsumer key --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84182/testReport)** for PR 19813 at commit [`65d07d5`](https://github.com/apache/spark/commit/65d07d525344e1d00457d2f538b2ef0b1c38a8e8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84182/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 > However for whole stage codegen path, it's way more complex: > 1. some of ctx.currentVars are just variables, their codes have already been generated before. But some are not. For those whose codes are not generated, they are not valid inputs. > 2. ctx.currentVars is not null but has null slots, and ctx.INPUT_ROW is not null. Then both ctx.currentVars and ctx.INPUT_ROW are valid inputs. Yes, this is correct. So, for 1, only the variables not evaluate yet, we don't include them as parameters. For 2, null slots in ctx.currentVars won't be included as parameters too. ctx.INPUT_ROW will be included only if it is not null. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84182/testReport)** for PR 19813 at commit [`65d07d5`](https://github.com/apache/spark/commit/65d07d525344e1d00457d2f538b2ef0b1c38a8e8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r153040576 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala --- @@ -236,4 +237,22 @@ class WholeStageCodegenSuite extends QueryTest with SharedSQLContext { } } } + + test("SPARK-22551: Fix 64kb limit for deeply nested expressions under wholestage codegen") { +import testImplicits._ +withTempPath { dir => + val path = dir.getCanonicalPath + val df = Seq(("abc", 1)).toDF("key", "int") + df.write.parquet(path) + + var strExpr: Expression = col("key").expr + for (_ <- 1 to 150) { +strExpr = Decode(Encode(strExpr, Literal("utf-8")), Literal("utf-8")) + } + val expressions = Seq(If(EqualTo(strExpr, strExpr), strExpr, strExpr)) + + val df2 = spark.read.parquet(path).select(expressions.map(Column(_)): _*) --- End diff -- Ok. Add assert to make sure this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19817: [SPARK-22603][SQL] Fix 64KB JVM bytecode limit problem w...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19817 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org