[GitHub] spark issue #17242: [SPARK-19902][SQL] Add optimization rule to simplify exp...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17242 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17330: [SPARK-19993][SQL] Caching logical plans containi...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/17330 [SPARK-19993][SQL] Caching logical plans containing subquery expressions does not work. ## What changes were proposed in this pull request? The sameResult() method does not work when the logical plan contains subquery expressions. Before the fix ```SQL scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)") ds: org.apache.spark.sql.DataFrame = [c1: int] scala> ds.cache res13: ds.type = [c1: int] scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)").explain(true) == Analyzed Logical Plan == c1: int Project [c1#86] +- Filter c1#86 IN (list#78 [c1#86]) : +- Project [c1#87] : +- Filter (outer(c1#86) = c1#87) :+- SubqueryAlias s2 : +- Relation[c1#87] parquet +- SubqueryAlias s1 +- Relation[c1#86] parquet == Optimized Logical Plan == Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87)) :- Relation[c1#86] parquet +- Relation[c1#87] parquet ``` Plan after fix ```SQL == Analyzed Logical Plan == c1: int Project [c1#22] +- Filter c1#22 IN (list#14 [c1#22]) : +- Project [c1#23] : +- Filter (outer(c1#22) = c1#23) :+- SubqueryAlias s2 : +- Relation[c1#23] parquet +- SubqueryAlias s1 +- Relation[c1#22] parquet == Optimized Logical Plan == InMemoryRelation [c1#22], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas) +- *BroadcastHashJoin [c1#1, c1#1], [c1#2, c1#2], LeftSemi, BuildRight :- *FileScan parquet default.s1[c1#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft(cast(input[0, int, true] as bigint), 32) | (cast(input[0, int, true] as bigint) & 4294967295 +- *FileScan parquet default.s2[c1#2] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ``` ## How was this patch tested? New tests are added to CachedTableSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark subquery_cache_final Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17330.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17330 commit dfa76dafdfaabee7240adbaaacb101e484418013 Author: Dilip Biswal Date: 2017-03-03T10:12:56Z [SPARK-19993] Caching logical plans containing subquery expressions does not work --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17242: [SPARK-19902][SQL] Add optimization rule to simplify exp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17242 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17242: [SPARK-19902][SQL] Add optimization rule to simplify exp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17242 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74725/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17242: [SPARK-19902][SQL] Add optimization rule to simplify exp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17242 **[Test build #74725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74725/testReport)** for PR 17242 at commit [`93b83ef`](https://github.com/apache/spark/commit/93b83ef0b15c453adddc459f57cccb36269e4e08). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17315: [SPARK-19949][SQL] unify bad record handling in CSV and ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17315 **[Test build #74728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74728/testReport)** for PR 17315 at commit [`10e70fe`](https://github.com/apache/spark/commit/10e70fefd13d744592807cbae4ba712f07f5debe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17308: [SPARK-19968][SS] Use a cached instance of `KafkaProduce...
Github user ScrapCodes commented on the issue: https://github.com/apache/spark/pull/17308 Please take a look, @tcondie @zsxwing ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17329 **[Test build #74727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74727/testReport)** for PR 17329 at commit [`e0a3b62`](https://github.com/apache/spark/commit/e0a3b6286f8d1171d921fd1f90d88ca825b12b56). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17320 **[Test build #74726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74726/testReport)** for PR 17320 at commit [`ce39a9d`](https://github.com/apache/spark/commit/ce39a9dae6d322d0b800b260b9a4822d9e0e1f1d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17320 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106588563 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -73,55 +86,219 @@ private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) e } def putBytes(blockId: BlockId, bytes: ChunkedByteBuffer): Unit = { -put(blockId) { fileOutputStream => - val channel = fileOutputStream.getChannel - Utils.tryWithSafeFinally { -bytes.writeFully(channel) - } { -channel.close() - } +put(blockId) { channel => + bytes.writeFully(channel) } } - def getBytes(blockId: BlockId): ChunkedByteBuffer = { + def getBytes(blockId: BlockId): BlockData = { val file = diskManager.getFile(blockId.name) -val channel = new RandomAccessFile(file, "r").getChannel -Utils.tryWithSafeFinally { - // For small files, directly read rather than memory map - if (file.length < minMemoryMapBytes) { -val buf = ByteBuffer.allocate(file.length.toInt) -channel.position(0) -while (buf.remaining() != 0) { - if (channel.read(buf) == -1) { -throw new IOException("Reached EOF before filling buffer\n" + - s"offset=0\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}") +val blockSize = getSize(blockId) + +securityManager.getIOEncryptionKey() match { + case Some(key) => +// Encrypted blocks cannot be memory mapped; return a special object that does decryption +// and provides InputStream / FileRegion implementations for reading the data. +new EncryptedBlockData(file, blockSize, conf, key) + + case _ => +val channel = new FileInputStream(file).getChannel() +if (blockSize < minMemoryMapBytes) { + // For small files, directly read rather than memory map. + Utils.tryWithSafeFinally { +val buf = ByteBuffer.allocate(blockSize.toInt) +while (buf.remaining() > 0) { + channel.read(buf) +} +buf.flip() +new ByteBufferBlockData(new ChunkedByteBuffer(buf)) + } { +channel.close() + } +} else { + Utils.tryWithSafeFinally { +new ByteBufferBlockData( + new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length))) + } { +channel.close() } } -buf.flip() -new ChunkedByteBuffer(buf) - } else { -new ChunkedByteBuffer(channel.map(MapMode.READ_ONLY, 0, file.length)) - } -} { - channel.close() } } def remove(blockId: BlockId): Boolean = { val file = diskManager.getFile(blockId.name) -if (file.exists()) { - val ret = file.delete() - if (!ret) { -logWarning(s"Error deleting ${file.getPath()}") +val meta = diskManager.getMetadataFile(blockId) + +def delete(f: File): Boolean = { + if (f.exists()) { +val ret = f.delete() +if (!ret) { + logWarning(s"Error deleting ${file.getPath()}") +} + +ret + } else { +false } - ret -} else { - false } + +delete(file) & delete(meta) } def contains(blockId: BlockId): Boolean = { val file = diskManager.getFile(blockId.name) file.exists() } + + private def openForWrite(file: File): WritableByteChannel = { +val out = new FileOutputStream(file).getChannel() +try { + securityManager.getIOEncryptionKey().map { key => +CryptoStreamUtils.createWritableChannel(out, conf, key) + }.getOrElse(out) +} catch { + case e: Exception => +out.close() +throw e +} + } + +} + +private class EncryptedBlockData( +file: File, +blockSize: Long, +conf: SparkConf, +key: Array[Byte]) extends BlockData { + + override def toInputStream(): InputStream = Channels.newInputStream(open()) + + override def toManagedBuffer(): ManagedBuffer = new EncryptedManagedBuffer() + + override def toByteBuffer(allocator: Int => ByteBuffer): ChunkedByteBuffer = { +val source = open() +try { + var remaining = blockSize + val chunks = new ListBuffer[ByteBuffer]() + while (remaining > 0) { +val chunkSize = math.min(remaining, Int.MaxValue) +val chunk = allocator(chun
[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16209 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74721/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16209 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16209 **[Test build #74721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74721/testReport)** for PR 16209 at commit [`e76b7e0`](https://github.com/apache/spark/commit/e76b7e0b6fab0adf30f2be7ea7be50298196ac72). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106587687 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1235,7 +1251,7 @@ private[spark] class BlockManager( peer.port, peer.executorId, blockId, - new NettyManagedBuffer(data.toNetty), + new BlockManagerManagedBuffer(blockInfoManager, blockId, data.toManagedBuffer()), --- End diff -- why this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17242: [SPARK-19902][SQL] Add optimization rule to simplify exp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17242 **[Test build #74725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74725/testReport)** for PR 17242 at commit [`93b83ef`](https://github.com/apache/spark/commit/93b83ef0b15c453adddc459f57cccb36269e4e08). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17329 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74715/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587332 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_features", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_features, ' ') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted", +#' numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction", + numPartitions = -1) { +if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 1) { + stop("minSupport should be a number [0, 1].") +} +if (!is.numeric(minConfidence) || minConfidence < 0 || minConfidence > 1) { + stop("minConfidence should be a number [0, 1].") +} + +jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit", +data@sdf, as.numeric(minSupport), as.numeric(minConfidence), +featuresCol, predictionCol, as.integer(numPartitions)) +new("FPGrowthModel", jobj = jobj) + }) +
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17329 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587261 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_features", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_features, ' ') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted", +#' numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction", + numPartitions = -1) { +if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 1) { + stop("minSupport should be a number [0, 1].") +} +if (!is.numeric(minConfidence) || minConfidence < 0 || minConfidence > 1) { + stop("minConfidence should be a number [0, 1].") +} + +jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit", +data@sdf, as.numeric(minSupport), as.numeric(minConfidence), +featuresCol, predictionCol, as.integer(numPartitions)) +new("FPGrowthModel", jobj = jobj) + }) +
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587357 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_features", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_features, ' ') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted", +#' numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction", + numPartitions = -1) { +if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 1) { + stop("minSupport should be a number [0, 1].") +} +if (!is.numeric(minConfidence) || minConfidence < 0 || minConfidence > 1) { + stop("minConfidence should be a number [0, 1].") +} + +jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit", +data@sdf, as.numeric(minSupport), as.numeric(minConfidence), +featuresCol, predictionCol, as.integer(numPartitions)) +new("FPGrowthModel", jobj = jobj) + }) +
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587496 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala --- @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.fpm.{FPGrowth, FPGrowthModel} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FPGrowthWrapper private (val fpGrowthModel: FPGrowthModel) extends MLWritable { + def freqItemsets: DataFrame = fpGrowthModel.freqItemsets + def associationRules: DataFrame = fpGrowthModel.associationRules + + def transform(dataset: Dataset[_]): DataFrame = { +fpGrowthModel.transform(dataset) + } + + override def write: MLWriter = new FPGrowthWrapper.FPGrowthWrapperWriter(this) +} + +private[r] object FPGrowthWrapper extends MLReadable[FPGrowthWrapper] { + + def fit( + data: DataFrame, + minSupport: Double, + minConfidence: Double, + featuresCol: String, + predictionCol: String, + numPartitions: Integer): FPGrowthWrapper = { +val fpGrowth = new FPGrowth() + .setMinSupport(minSupport) + .setMinConfidence(minConfidence) + .setPredictionCol(predictionCol) + +if (numPartitions != null && numPartitions > 0) { + fpGrowth.setNumPartitions(numPartitions) +} + +val fpGrowthModel = fpGrowth.fit(data) + +new FPGrowthWrapper(fpGrowthModel) + } + + override def read: MLReader[FPGrowthWrapper] = new FPGrowthWrapperReader + + class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper] { +override def load(path: String): FPGrowthWrapper = { + val modelPath = new Path(path, "model").toString + val fPGrowthModel = FPGrowthModel.load(modelPath) + + new FPGrowthWrapper(fPGrowthModel) +} + } + +class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends MLWriter { --- End diff -- indentation seems incorrect here and above line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587413 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/FPGrowthWrapper.scala --- @@ -0,0 +1,87 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.fpm.{FPGrowth, FPGrowthModel} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FPGrowthWrapper private (val fpGrowthModel: FPGrowthModel) extends MLWritable { + def freqItemsets: DataFrame = fpGrowthModel.freqItemsets + def associationRules: DataFrame = fpGrowthModel.associationRules + + def transform(dataset: Dataset[_]): DataFrame = { +fpGrowthModel.transform(dataset) + } + + override def write: MLWriter = new FPGrowthWrapper.FPGrowthWrapperWriter(this) +} + +private[r] object FPGrowthWrapper extends MLReadable[FPGrowthWrapper] { + + def fit( + data: DataFrame, --- End diff -- alignment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587130 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' --- End diff -- Other APIs do not have blank line here. I think we should be consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587054 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. PFP distributes computation in such a way that each worker executes an --- End diff -- This line seems exceeding the length limit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587315 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_features", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_features, ' ') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted", +#' numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction", + numPartitions = -1) { +if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 1) { + stop("minSupport should be a number [0, 1].") +} +if (!is.numeric(minConfidence) || minConfidence < 0 || minConfidence > 1) { + stop("minConfidence should be a number [0, 1].") +} + +jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit", +data@sdf, as.numeric(minSupport), as.numeric(minConfidence), +featuresCol, predictionCol, as.integer(numPartitions)) +new("FPGrowthModel", jobj = jobj) + }) +
[GitHub] spark pull request #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/17170#discussion_r106587292 --- Diff: R/pkg/R/mllib_fpm.R --- @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# mllib_fpm.R: Provides methods for MLlib frequent pattern mining algorithms integration + +#' S4 class that represents a FPGrowthModel +#' +#' @param jobj a Java object reference to the backing Scala FPGrowthModel +#' @export +#' @note FPGrowthModel since 2.2.0 +setClass("FPGrowthModel", slots = list(jobj = "jobj")) + +#' FPGrowth +#' +#' A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in +#' Li et al., PFP: Parallel FP-Growth for Query +#' Recommendation <\url{http://dx.doi.org/10.1145/1454008.1454027}>. PFP distributes computation in such a way that each worker executes an +#' independent group of mining tasks. The FP-Growth algorithm is described in +#' Han et al., Mining frequent patterns without +#' candidate generation <\url{http://dx.doi.org/10.1145/335191.335372}>. +#' +#' @param data A SparkDataFrame for training. +#' @param minSupport Minimal support level. +#' @param minConfidence Minimal confidence level. +#' @param featuresCol Features column name. +#' @param predictionCol Prediction column name. +#' @param numPartitions Number of partitions used for fitting. +#' @param ... additional argument(s) passed to the method. +#' @return \code{spark.fpGrowth} returns a fitted FPGrowth model. +#' +#' @rdname spark.fpGrowth +#' @name spark.fpGrowth +#' @aliases spark.fpGrowth,SparkDataFrame-method +#' @export +#' @examples +#' \dontrun{ +#' raw_data <- read.df( +#' "data/mllib/sample_fpgrowth.txt", +#' source = "csv", +#' schema = structType(structField("raw_features", "string"))) +#' +#' data <- selectExpr(raw_data, "split(raw_features, ' ') as features") +#' model <- spark.fpGrowth(data) +#' +#' # Show frequent itemsets +#' frequent_itemsets <- spark.freqItemsets(model) +#' showDF(frequent_itemsets) +#' +#' # Show association rules +#' association_rules <- spark.associationRules(model) +#' showDF(association_rules) +#' +#' # Predict on new data +#' new_itemsets <- data.frame(features = c("t", "t,s")) +#' new_data <- selectExpr(createDataFrame(new_itemsets), "split(features, ',') as features") +#' predict(model, new_data) +#' +#' # Save and load model +#' path <- "/path/to/model" +#' write.ml(model, path) +#' read.ml(path) +#' +#' # Optional arguments +#' baskets_data <- selectExpr(createDataFrame(itemsets), "split(features, ',') as baskets") +#' another_model <- spark.fpGrowth(data, minSupport = 0.1, minConfidence = 0.5 +#' featureCol = "baskets", predictionCol = "predicted", +#' numPartitions = 10) +#' } +#' @references \url{http://en.wikipedia.org/wiki/Association_rule_learning} +#' @note spark.fpGrowth since 2.2.0 +setMethod("spark.fpGrowth", signature(data = "SparkDataFrame"), + function(data, minSupport = 0.3, minConfidence = 0.8, + featuresCol = "features", predictionCol = "prediction", + numPartitions = -1) { +if (!is.numeric(minSupport) || minSupport < 0 || minSupport > 1) { + stop("minSupport should be a number [0, 1].") +} +if (!is.numeric(minConfidence) || minConfidence < 0 || minConfidence > 1) { + stop("minConfidence should be a number [0, 1].") +} + +jobj <- callJStatic("org.apache.spark.ml.r.FPGrowthWrapper", "fit", +data@sdf, as.numeric(minSupport), as.numeric(minConfidence), +featuresCol, predictionCol, as.integer(numPartitions)) +new("FPGrowthModel", jobj = jobj) + }) +
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17329 **[Test build #74715 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74715/testReport)** for PR 17329 at commit [`dc5fd8d`](https://github.com/apache/spark/commit/dc5fd8dda4717b485d4c4b2dfcdc5d115abf811c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106587428 --- Diff: core/src/main/scala/org/apache/spark/security/CryptoStreamUtils.scala --- @@ -102,4 +150,34 @@ private[spark] object CryptoStreamUtils extends Logging { } iv } + + /** + * This class is a workaround for CRYPTO-125, that forces all bytes to be written to the + * underlying channel. Since the callers of this API are using blocking I/O, there are no + * concerns with regards to CPU usage here. --- End diff -- is it a separated bug fix? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17295: [SPARK-19556][core] Do not encrypt block manager ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17295#discussion_r106587322 --- Diff: core/src/main/scala/org/apache/spark/security/CryptoStreamUtils.scala --- @@ -63,12 +83,40 @@ private[spark] object CryptoStreamUtils extends Logging { is: InputStream, sparkConf: SparkConf, key: Array[Byte]): InputStream = { -val properties = toCryptoConf(sparkConf) val iv = new Array[Byte](IV_LENGTH_IN_BYTES) -is.read(iv, 0, iv.length) -val transformationStr = sparkConf.get(IO_CRYPTO_CIPHER_TRANSFORMATION) -new CryptoInputStream(transformationStr, properties, is, - new SecretKeySpec(key, "AES"), new IvParameterSpec(iv)) +var read = 0 +while (read < iv.length) { --- End diff -- what does this while loop do? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17242: [SPARK-19902][SQL] Add optimization rule to simplify exp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17242 **[Test build #74722 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74722/testReport)** for PR 17242 at commit [`f4e771d`](https://github.com/apache/spark/commit/f4e771d85a33ff465d793e74ff4401453eaf0f3b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16971 **[Test build #74724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74724/testReport)** for PR 16971 at commit [`ed6dacd`](https://github.com/apache/spark/commit/ed6dacdb3e3bdfd4e9ccb5c57bf8b4118636b0c6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17192 **[Test build #74723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74723/testReport)** for PR 17192 at commit [`703a6cb`](https://github.com/apache/spark/commit/703a6cb36ea920e87a3536f16572020c11197345). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17329 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74714/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17329 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17329 **[Test build #74714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74714/testReport)** for PR 17329 at commit [`abcfc79`](https://github.com/apache/spark/commit/abcfc79991ecd1d5cef2cd1e275b872695ba19d9). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17192 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17192 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17192 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74718/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17192 **[Test build #74718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74718/testReport)** for PR 17192 at commit [`8c97406`](https://github.com/apache/spark/commit/8c97406b984ab68b74df2116547c1dbedb675785). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class JsonToStructs(` * `case class StructsToJson(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74710/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17088: [SPARK-19753][CORE] Un-register all shuffle output on a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17088 **[Test build #74710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74710/testReport)** for PR 17088 at commit [`8787db1`](https://github.com/apache/spark/commit/8787db1679c5b468afa3d2ede64eee53908fa5de). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16971 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16971 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74717/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16971 **[Test build #74717 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74717/testReport)** for PR 16971 at commit [`00d67f7`](https://github.com/apache/spark/commit/00d67f71e8c3eb254eabb63a53efdf675689aeb3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17320 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74719/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17320 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17320 **[Test build #74719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74719/testReport)** for PR 17320 at commit [`ce39a9d`](https://github.com/apache/spark/commit/ce39a9dae6d322d0b800b260b9a4822d9e0e1f1d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17192 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74720/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17192 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17192 **[Test build #74720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74720/testReport)** for PR 17192 at commit [`703a6cb`](https://github.com/apache/spark/commit/703a6cb36ea920e87a3536f16572020c11197345). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17302 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74716/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17302 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17302 **[Test build #74716 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74716/testReport)** for PR 17302 at commit [`43678e7`](https://github.com/apache/spark/commit/43678e793148521b44713b4373d89d8db0bb2e66). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17095: [SPARK-19763][SQL]qualified external datasource table lo...
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/17095 Sounds like this was caused by a different PR (see the comment on the JIRA) and is now being fixed, so never mind here! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17295: [SPARK-19556][core] Do not encrypt block manager data in...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17295 makes sense. one more question, ideally, shall we also transfer shuffle blocks after decryption? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16209: [WIP][SPARK-10849][SQL] Adds option to the JDBC data sou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16209 **[Test build #74721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74721/testReport)** for PR 16209 at commit [`e76b7e0`](https://github.com/apache/spark/commit/e76b7e0b6fab0adf30f2be7ea7be50298196ac72). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17191 okay, I'll recheck the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17191 ok makes sense, let's support it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17286: [SPARK-19915][SQL] Exclude cartesian product candidates ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17286 LGTM except some minor comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17286#discussion_r106583389 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala --- @@ -187,6 +220,8 @@ class JoinReorderSuite extends PlanTest with StatsEstimationTestBase { case (j1: Join, j2: Join) => (sameJoinPlan(j1.left, j2.left) && sameJoinPlan(j1.right, j2.right)) || (sameJoinPlan(j1.left, j2.right) && sameJoinPlan(j1.right, j2.left)) + case _ if plan1.children.nonEmpty && plan2.children.nonEmpty => --- End diff -- when will we hit this branch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17286#discussion_r106583140 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -710,6 +710,14 @@ object SQLConf { .intConf .createWithDefault(12) + val JOIN_REORDER_CARD_WEIGHT = +buildConf("spark.sql.cbo.joinReorder.card.weight") + .doc("The weight of cardinality (number of rows) for plan cost comparison in join reorder: " + +"rows * weight + size * (1 - weight).") + .doubleConf + .checkValue(weight => weight >= 0 && weight <= 1, "The weight value must be in [0, 1].") + .createWithDefault(0.7) --- End diff -- it is useful to expose this config? I think most of the users will just disable join reordering if they have problems. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17286#discussion_r106583013 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -203,64 +205,46 @@ object JoinReorderDP extends PredicateHelper { private def buildJoin( oneJoinPlan: JoinPlan, otherJoinPlan: JoinPlan, - conf: CatalystConf, + conf: SQLConf, conditions: Set[Expression], - topOutput: AttributeSet): JoinPlan = { + topOutput: AttributeSet): Option[JoinPlan] = { val onePlan = oneJoinPlan.plan val otherPlan = otherJoinPlan.plan -// Now both onePlan and otherPlan become intermediate joins, so the cost of the -// new join should also include their own cardinalities and sizes. -val newCost = if (isCartesianProduct(onePlan) || isCartesianProduct(otherPlan)) { - // We consider cartesian product very expensive, thus set a very large cost for it. - // This enables to plan all the cartesian products at the end, because having a cartesian - // product as an intermediate join will significantly increase a plan's cost, making it - // impossible to be selected as the best plan for the items, unless there's no other choice. - Cost( -rows = BigInt(Long.MaxValue) * BigInt(Long.MaxValue), -size = BigInt(Long.MaxValue) * BigInt(Long.MaxValue)) -} else { - val onePlanStats = onePlan.stats(conf) - val otherPlanStats = otherPlan.stats(conf) - Cost( -rows = oneJoinPlan.cost.rows + onePlanStats.rowCount.get + - otherJoinPlan.cost.rows + otherPlanStats.rowCount.get, -size = oneJoinPlan.cost.size + onePlanStats.sizeInBytes + - otherJoinPlan.cost.size + otherPlanStats.sizeInBytes) -} - -// Put the deeper side on the left, tend to build a left-deep tree. -val (left, right) = if (oneJoinPlan.itemIds.size >= otherJoinPlan.itemIds.size) { - (onePlan, otherPlan) -} else { - (otherPlan, onePlan) -} val joinConds = conditions .filterNot(l => canEvaluate(l, onePlan)) .filterNot(r => canEvaluate(r, otherPlan)) .filter(e => e.references.subsetOf(onePlan.outputSet ++ otherPlan.outputSet)) -// We use inner join whether join condition is empty or not. Since cross join is -// equivalent to inner join without condition. -val newJoin = Join(left, right, Inner, joinConds.reduceOption(And)) -val collectedJoinConds = joinConds ++ oneJoinPlan.joinConds ++ otherJoinPlan.joinConds -val remainingConds = conditions -- collectedJoinConds -val neededAttr = AttributeSet(remainingConds.flatMap(_.references)) ++ topOutput -val neededFromNewJoin = newJoin.outputSet.filter(neededAttr.contains) -val newPlan = - if ((newJoin.outputSet -- neededFromNewJoin).nonEmpty) { -Project(neededFromNewJoin.toSeq, newJoin) +if (joinConds.isEmpty) { + // Cartesian product is very expensive, so we exclude them from candidate plans. + // This also significantly reduces the search space. --- End diff -- great! now we can safely apply this optimization :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17286#discussion_r106582845 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -272,26 +256,39 @@ object JoinReorderDP extends PredicateHelper { * @param itemIds Set of item ids participating in this partial plan. * @param plan The plan tree with the lowest cost for these items found so far. * @param joinConds Join conditions included in the plan. - * @param cost The cost of this plan is the sum of costs of all intermediate joins. + * @param planCost The cost of this plan tree is the sum of costs of all intermediate joins. --- End diff -- I think `cost` is good enough, why rename it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17286#discussion_r106582668 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -185,11 +184,14 @@ object JoinReorderDP extends PredicateHelper { // Should not join two overlapping item sets. if (oneSidePlan.itemIds.intersect(otherSidePlan.itemIds).isEmpty) { val joinPlan = buildJoin(oneSidePlan, otherSidePlan, conf, conditions, topOutput) -// Check if it's the first plan for the item set, or it's a better plan than -// the existing one due to lower cost. -val existingPlan = nextLevel.get(joinPlan.itemIds) -if (existingPlan.isEmpty || joinPlan.cost.lessThan(existingPlan.get.cost)) { - nextLevel.update(joinPlan.itemIds, joinPlan) +if (joinPlan.isDefined) { --- End diff -- when will this condition be false? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17286#discussion_r106582563 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala --- @@ -128,38 +131,34 @@ case class CostBasedJoinReorder(conf: CatalystConf) extends Rule[LogicalPlan] wi object JoinReorderDP extends PredicateHelper { def search( - conf: CatalystConf, + conf: SQLConf, items: Seq[LogicalPlan], conditions: Set[Expression], - topOutput: AttributeSet): Option[LogicalPlan] = { + topOutput: AttributeSet): LogicalPlan = { // Level i maintains all found plans for i + 1 items. // Create the initial plans: each plan is a single item with zero cost. -val itemIndex = items.zipWithIndex +val itemIndex = items.zipWithIndex.map(_.swap).toMap --- End diff -- looks like an unnecessary change now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17191 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17191 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74713/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17191 **[Test build #74713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74713/testReport)** for PR 17191 at commit [`5d8c853`](https://github.com/apache/spark/commit/5d8c8532433fc2ebebdf506d636238e6b644b4ae). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17179: [SPARK-19067][SS] Processing-time-based timeout in MapGr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17179 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74707/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17179: [SPARK-19067][SS] Processing-time-based timeout in MapGr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17179 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17192 **[Test build #74720 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74720/testReport)** for PR 17192 at commit [`703a6cb`](https://github.com/apache/spark/commit/703a6cb36ea920e87a3536f16572020c11197345). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17179: [SPARK-19067][SS] Processing-time-based timeout in MapGr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17179 **[Test build #74707 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74707/testReport)** for PR 17179 at commit [`1d0008c`](https://github.com/apache/spark/commit/1d0008cedd3e37832b31b75451eaf7a67ab832f3). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class KeyedStateTimeout ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17320: [SPARK-19967][SQL] Add from_json in FunctionRegistry
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17320 **[Test build #74719 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74719/testReport)** for PR 17320 at commit [`ce39a9d`](https://github.com/apache/spark/commit/ce39a9dae6d322d0b800b260b9a4822d9e0e1f1d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17192: [SPARK-19849][SQL] Support ArrayType in to_json to produ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17192 **[Test build #74718 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74718/testReport)** for PR 17192 at commit [`8c97406`](https://github.com/apache/spark/commit/8c97406b984ab68b74df2116547c1dbedb675785). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17216 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74708/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17216 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17216 **[Test build #74708 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74708/testReport)** for PR 17216 at commit [`4733b4e`](https://github.com/apache/spark/commit/4733b4e160bff010521319a1aa61e4f7981c65d6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17327: [SPARK-19721][SS][BRANCH-2.1] Good error message for ver...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17327 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17327: [SPARK-19721][SS][BRANCH-2.1] Good error message for ver...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17327 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74704/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17327: [SPARK-19721][SS][BRANCH-2.1] Good error message for ver...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17327 **[Test build #74704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74704/testReport)** for PR 17327 at commit [`daabb27`](https://github.com/apache/spark/commit/daabb27aa32cb19c157e19081f6d08ff368bb42b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16971 **[Test build #74717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74717/testReport)** for PR 16971 at commit [`00d67f7`](https://github.com/apache/spark/commit/00d67f71e8c3eb254eabb63a53efdf675689aeb3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17191: [SPARK-14471][SQL] Aliases in SELECT could be used in GR...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17191 @cloud-fan In the mixed case, it seems PotgreSQL and MySQL support the syntax; ``` // PostgreSQL v9.5 postgres=# \d t2 Table "public.t2" Column | Type | Modifiers +-+--- gkey1 | integer | gkey2 | integer | value | integer | postgres=# select gkey1 AS key1, gkey2, count(value) from t2 group by key1, 2; key1 | gkey2 | count --+---+--- 1 | 1 | 1 (1 row) // MySQL v5.7.13 mysql> SHOW COLUMNS FROM t2; +---+-+--+-+-+---+ | Field | Type| Null | Key | Default | Extra | +---+-+--+-+-+---+ | gkey1 | int(11) | YES | | NULL| | | gkey2 | int(11) | YES | | NULL| | | value | int(11) | YES | | NULL| | +---+-+--+-+-+---+ 3 rows in set (0.00 sec) mysql> select gkey1 AS key1, gkey2, count(value) from t2 group by key1, 2; +--+---+--+ | key1 | gkey2 | count(value) | +--+---+--+ |1 | 1 |1 | +--+---+--+ 1 row in set (0.00 sec) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17302: [SPARK-19959][SQL] Fix to throw NullPointerException in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17302 **[Test build #74716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74716/testReport)** for PR 17302 at commit [`43678e7`](https://github.com/apache/spark/commit/43678e793148521b44713b4373d89d8db0bb2e66). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17329 **[Test build #74715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74715/testReport)** for PR 17329 at commit [`dc5fd8d`](https://github.com/apache/spark/commit/dc5fd8dda4717b485d4c4b2dfcdc5d115abf811c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17166 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74702/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17307: [SPARK-13369] Make number of consecutive fetch failures ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17307 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74699/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17166: [SPARK-19820] [core] Allow reason to be specified for ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17166 **[Test build #74702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74702/testReport)** for PR 17166 at commit [`8f7ffb3`](https://github.com/apache/spark/commit/8f7ffb395cae9ae7aa24a14dcdb908aaee30b710). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17307: [SPARK-13369] Make number of consecutive fetch failures ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17307 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17307: [SPARK-13369] Make number of consecutive fetch failures ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17307 **[Test build #74699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74699/testReport)** for PR 17307 at commit [`0f95c8b`](https://github.com/apache/spark/commit/0f95c8b1ad260abb1a64d9cbd25d09a1bafeb1d8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17216: [SPARK-19873][SS] Record num shuffle partitions in offse...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17216 Does this PR mix in some test file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17329: [SPARK-19991]FileSegmentManagedBuffer performance improv...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17329 **[Test build #74714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74714/testReport)** for PR 17329 at commit [`abcfc79`](https://github.com/apache/spark/commit/abcfc79991ecd1d5cef2cd1e275b872695ba19d9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17329: [SPARK-19991]FileSegmentManagedBuffer performance...
GitHub user witgo opened a pull request: https://github.com/apache/spark/pull/17329 [SPARK-19991]FileSegmentManagedBuffer performance improvement FileSegmentManagedBuffer performance improvement. ## What changes were proposed in this pull request? When we do not set the value of the configuration items `spark.storage.memoryMapThreshold` and `spark.shuffle.io.lazyFD`, each call to the cFileSegmentManagedBuffer.nioByteBuffer or FileSegmentManagedBuffer.createInputStream method creates a NoSuchElementException instance. This is a more time-consuming operation. In the use case, this PR can improve the performance of about 3.5% The test code: ``` scala (1 to 10).foreach { i => val numPartition = 1 val rdd = sc.parallelize(0 until numPartition).repartition(numPartition).flatMap { t => (0 until numPartition).map(r => r * numPartition + t) }.repartition(numPartition) val serializeStart = System.currentTimeMillis() rdd.sum() val serializeFinish = System.currentTimeMillis() println(f"Test $i: ${(serializeFinish - serializeStart) / 1000D}%1.2f") } ``` and `spark-defaults.conf` file: ``` spark.master yarn-client spark.executor.instances 20 spark.driver.memory 64g spark.executor.memory 30g spark.executor.cores 5 spark.default.parallelism 100 spark.sql.shuffle.partitions 100 spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.maxResultSize0 spark.ui.enabled false spark.driver.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=512M spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M spark.cleaner.referenceTracking.blocking true spark.cleaner.referenceTracking.blocking.shuffle true ``` The test results are as follows | [SPARK-19991](https://github.com/witgo/spark/tree/SPARK-19991) |https://github.com/apache/spark/commit/68ea290b3aa89b2a539d13ea2c18bdb5a651b2bf| |---| --- | |226.09 s| 235.21 s| ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/witgo/spark SPARK-19991 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17329.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17329 commit abcfc79991ecd1d5cef2cd1e275b872695ba19d9 Author: Guoqiang Li Date: 2017-03-17T03:19:37Z FileSegmentManagedBuffer performance improvement --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16971 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74705/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16971 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16971: [SPARK-19573][SQL] Make NaN/null handling consistent in ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16971 **[Test build #74705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74705/testReport)** for PR 16971 at commit [`7bf7db3`](https://github.com/apache/spark/commit/7bf7db3114234dc900a9fd7a9b36615fa0bc1a3f). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16028: [SPARK-18518][ML] HasSolver supports override
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16028 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74709/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16028: [SPARK-18518][ML] HasSolver supports override
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16028 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org