[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21625 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197709175 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -573,32 +578,6 @@ object DataSourceReadBenchmark { } } -/* -Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz -Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - --- End diff -- Anyway, I updated the results by applying #21631 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197679212 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -573,32 +578,6 @@ object DataSourceReadBenchmark { } } -/* -Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz -Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - --- End diff -- @HyukjinKwon I'm currently fixing this now. But, it seems this bug is similar to SPARK-24645. So, would it be better to merge this fix with SPARK-24645? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197679077 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -573,32 +578,6 @@ object DataSourceReadBenchmark { } } -/* -Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz -Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - --- End diff -- yea, I though I would do so first, but I couldn't because I hit another bug when the column pruning disabled...; ``` ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false scala> val dir = "/tmp/spark-csv/csv" scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir) scala> spark.read.csv(dir).selectExpr("sum(p)").collect() 18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41) ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197678386 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -573,32 +578,6 @@ object DataSourceReadBenchmark { } } -/* -Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz -Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - --- End diff -- @maropu, if the JIRA blocks this PR, please feel free to set the configuration to false and proceed. Technically, looks that's what the benchmark originally covered at that time it's merged in. Setting it true can be separately done in the JIRA you opened. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197676313 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -573,32 +578,6 @@ object DataSourceReadBenchmark { } } -/* -Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz -Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - --- End diff -- I filed a jira; https://issues.apache.org/jira/browse/SPARK-24645 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197676056 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -573,32 +578,6 @@ object DataSourceReadBenchmark { } } -/* -Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz -Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - --- End diff -- oh, I hit the bug in csv parsing when updating this benchmark... ``` scala> val dir = "/tmp/spark-csv/csv" scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir) scala> spark.read.csv(dir).selectExpr("sum(p)").collect() 18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5) java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309) at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61) ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197652431 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -39,9 +39,11 @@ import org.apache.spark.util.{Benchmark, Utils} object DataSourceReadBenchmark { val conf = new SparkConf() .setAppName("DataSourceReadBenchmark") -.setIfMissing("spark.master", "local[1]") +// Since `spark.master` always exists, overrides this value +.set("spark.master", "local[1]") --- End diff -- Thank you for fixing this and updating the result, @maropu . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197628635 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -573,32 +578,6 @@ object DataSourceReadBenchmark { } } -/* -Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz -Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - --- End diff -- oh, thanks. I'll update soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21625#discussion_r197627610 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala --- @@ -573,32 +578,6 @@ object DataSourceReadBenchmark { } } -/* -Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz -Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - --- End diff -- Seems missed to update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/21625 [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBenchmark benchmark results ## What changes were proposed in this pull request? This pr corrected the default configuration (`spark.master=local[1]`) for benchmarks. Also, this updated performance results on the AWS `r3.xlarge`. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark FixDataSourceReadBenchmark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21625.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21625 commit 23528200f833f236a83d6b891388b6ec698bcac7 Author: Takeshi Yamamuro Date: 2018-06-16T01:48:15Z Fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org