[spark] branch master updated (b4a6eb6 -> 6d42230)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b4a6eb6 [SPARK-37164][SQL] Add ExpressionBuilder for functions with complex overloads add 6d42230 [SPARK-37168][SQL] Improve error messages for SQL functions and operators under ANSI mode No new revisions were added by this update. Summary of changes: core/src/main/resources/error/error-classes.json | 6 +- .../catalyst/expressions/datetimeExpressions.scala | 90 ++ .../catalyst/expressions/intervalExpressions.scala | 13 +++- .../spark/sql/catalyst/util/DateTimeUtils.scala| 6 +- .../spark/sql/errors/QueryExecutionErrors.scala| 48 ++-- .../expressions/StringExpressionsSuite.scala | 2 +- .../resources/sql-tests/results/ansi/array.sql.out | 14 ++-- .../resources/sql-tests/results/ansi/date.sql.out | 8 +- .../resources/sql-tests/results/ansi/map.sql.out | 4 +- .../sql-tests/results/ansi/timestamp.sql.out | 20 ++--- .../sql-tests/results/postgreSQL/date.sql.out | 6 +- .../results/timestampNTZ/timestamp-ansi.sql.out| 20 ++--- 12 files changed, 155 insertions(+), 82 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (320fa07 -> b4a6eb6)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 320fa07 [SPARK-37159][SQL][TESTS] Change HiveExternalCatalogVersionsSuite to be able to test with Java 17 add b4a6eb6 [SPARK-37164][SQL] Add ExpressionBuilder for functions with complex overloads No new revisions were added by this update. Summary of changes: .../sql/catalyst/analysis/FunctionRegistry.scala | 20 ++- .../catalyst/expressions/stringExpressions.scala | 163 +++-- .../spark/sql/errors/QueryCompilationErrors.scala | 4 +- .../expressions/StringExpressionsSuite.scala | 4 +- .../scala/org/apache/spark/sql/functions.scala | 4 +- .../sql-functions/sql-expression-schema.md | 4 +- .../sql-tests/inputs/string-functions.sql | 8 +- .../results/ansi/string-functions.sql.out | 16 +- .../sql-tests/results/string-functions.sql.out | 28 ++-- 9 files changed, 137 insertions(+), 114 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 615e525 [MINOR][DOCS] Corrected spacing in structured streaming programming 615e525 is described below commit 615e5257887e8e7a0879ccca43bfbe0ebf161f28 Author: mans2singh AuthorDate: Tue Nov 2 11:01:57 2021 +0900 [MINOR][DOCS] Corrected spacing in structured streaming programming ### What changes were proposed in this pull request? There is no space between `with` and `` as shown below: `... configured withspark.sql.streaming.fileSource.cleaner.numThreads ...` ### Why are the changes needed? Added space ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only documentation was changed and no code was change. Closes #34458 from mans2singh/structured_streaming_programming_guide_space. Authored-by: mans2singh Signed-off-by: Hyukjin Kwon (cherry picked from commit 675071a38e47dc2c55cf4f71de7ad0bebc1b4f2b) Signed-off-by: Hyukjin Kwon --- docs/structured-streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 31b1ca9..84296e0 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -553,7 +553,7 @@ Here are the details of all the sources in Spark. For example, suppose you provide '/hello?/spark/*' as source pattern, '/hello1/spark/archive/dir' cannot be used as the value of "sourceArchiveDir", as '/hello?/spark/*' and '/hello1/spark/archive' will be matched. '/hello1/spark' cannot be also used as the value of "sourceArchiveDir", as '/hello?/spark' and '/hello1/spark' will be matched. '/archived/here' would be OK as it doesn't match. Spark will move source files respecting their own path. For example, if the path of source file is /a/b/dataset.txt and the path of archive directory is /archived/here, file will be moved to /archived/here/a/b/dataset.txt. NOTE: Both archiving (via moving) or deleting completed files will introduce overhead (slow down, even if it's happening in separate thread) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enabling this option will reduce the cost to list source files which can be an expensive operation. -Number of threads used in completed file cleaner can be configured withspark.sql.streaming.fileSource.cleaner.numThreads (default: 1). +Number of threads used in completed file cleaner can be configured with spark.sql.streaming.fileSource.cleaner.numThreads (default: 1). NOTE 2: The source path should not be used from multiple sources or queries when enabling this option. Similarly, you must ensure the source path doesn't match to any files in output directory of file stream sink. NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query. Spark may not clean up some source files in some circumstances - e.g. the application doesn't shut down gracefully, too many files are queued to clean up. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new e9bead7 [MINOR][DOCS] Corrected spacing in structured streaming programming e9bead7 is described below commit e9bead79f8555faa8ba6a3b2ca9925a28022bee9 Author: mans2singh AuthorDate: Tue Nov 2 11:01:57 2021 +0900 [MINOR][DOCS] Corrected spacing in structured streaming programming ### What changes were proposed in this pull request? There is no space between `with` and `` as shown below: `... configured withspark.sql.streaming.fileSource.cleaner.numThreads ...` ### Why are the changes needed? Added space ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only documentation was changed and no code was change. Closes #34458 from mans2singh/structured_streaming_programming_guide_space. Authored-by: mans2singh Signed-off-by: Hyukjin Kwon (cherry picked from commit 675071a38e47dc2c55cf4f71de7ad0bebc1b4f2b) Signed-off-by: Hyukjin Kwon --- docs/structured-streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index d88cf91b..28d312e 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -553,7 +553,7 @@ Here are the details of all the sources in Spark. For example, suppose you provide '/hello?/spark/*' as source pattern, '/hello1/spark/archive/dir' cannot be used as the value of "sourceArchiveDir", as '/hello?/spark/*' and '/hello1/spark/archive' will be matched. '/hello1/spark' cannot be also used as the value of "sourceArchiveDir", as '/hello?/spark' and '/hello1/spark' will be matched. '/archived/here' would be OK as it doesn't match. Spark will move source files respecting their own path. For example, if the path of source file is /a/b/dataset.txt and the path of archive directory is /archived/here, file will be moved to /archived/here/a/b/dataset.txt. NOTE: Both archiving (via moving) or deleting completed files will introduce overhead (slow down, even if it's happening in separate thread) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enabling this option will reduce the cost to list source files which can be an expensive operation. -Number of threads used in completed file cleaner can be configured withspark.sql.streaming.fileSource.cleaner.numThreads (default: 1). +Number of threads used in completed file cleaner can be configured with spark.sql.streaming.fileSource.cleaner.numThreads (default: 1). NOTE 2: The source path should not be used from multiple sources or queries when enabling this option. Similarly, you must ensure the source path doesn't match to any files in output directory of file stream sink. NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query. Spark may not clean up some source files in some circumstances - e.g. the application doesn't shut down gracefully, too many files are queued to clean up. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [MINOR][DOCS] Corrected spacing in structured streaming programming
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new a63d2d2 [MINOR][DOCS] Corrected spacing in structured streaming programming a63d2d2 is described below commit a63d2d2c31af6180a15c13098a1345523a0712c6 Author: mans2singh AuthorDate: Tue Nov 2 11:01:57 2021 +0900 [MINOR][DOCS] Corrected spacing in structured streaming programming ### What changes were proposed in this pull request? There is no space between `with` and `` as shown below: `... configured withspark.sql.streaming.fileSource.cleaner.numThreads ...` ### Why are the changes needed? Added space ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only documentation was changed and no code was change. Closes #34458 from mans2singh/structured_streaming_programming_guide_space. Authored-by: mans2singh Signed-off-by: Hyukjin Kwon (cherry picked from commit 675071a38e47dc2c55cf4f71de7ad0bebc1b4f2b) Signed-off-by: Hyukjin Kwon --- docs/structured-streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 4642d44..3aa0d4c 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -553,7 +553,7 @@ Here are the details of all the sources in Spark. For example, suppose you provide '/hello?/spark/*' as source pattern, '/hello1/spark/archive/dir' cannot be used as the value of "sourceArchiveDir", as '/hello?/spark/*' and '/hello1/spark/archive' will be matched. '/hello1/spark' cannot be also used as the value of "sourceArchiveDir", as '/hello?/spark' and '/hello1/spark' will be matched. '/archived/here' would be OK as it doesn't match. Spark will move source files respecting their own path. For example, if the path of source file is /a/b/dataset.txt and the path of archive directory is /archived/here, file will be moved to /archived/here/a/b/dataset.txt. NOTE: Both archiving (via moving) or deleting completed files will introduce overhead (slow down, even if it's happening in separate thread) in each micro-batch, so you need to understand the cost for each operation in your file system before enabling this option. On the other hand, enabling this option will reduce the cost to list source files which can be an expensive operation. -Number of threads used in completed file cleaner can be configured withspark.sql.streaming.fileSource.cleaner.numThreads (default: 1). +Number of threads used in completed file cleaner can be configured with spark.sql.streaming.fileSource.cleaner.numThreads (default: 1). NOTE 2: The source path should not be used from multiple sources or queries when enabling this option. Similarly, you must ensure the source path doesn't match to any files in output directory of file stream sink. NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query. Spark may not clean up some source files in some circumstances - e.g. the application doesn't shut down gracefully, too many files are queued to clean up. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (675071a -> 320fa07)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 675071a [MINOR][DOCS] Corrected spacing in structured streaming programming add 320fa07 [SPARK-37159][SQL][TESTS] Change HiveExternalCatalogVersionsSuite to be able to test with Java 17 No new revisions were added by this update. Summary of changes: .../apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (cf7fbc1 -> 675071a)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from cf7fbc1 [SPARK-36554][SQL][PYTHON] Expose make_date expression in functions.scala add 675071a [MINOR][DOCS] Corrected spacing in structured streaming programming No new revisions were added by this update. Summary of changes: docs/structured-streaming-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (11de0fd -> cf7fbc1)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 11de0fd [MINOR][DOCS] Add import for MultivariateGaussian to Docs add cf7fbc1 [SPARK-36554][SQL][PYTHON] Expose make_date expression in functions.scala No new revisions were added by this update. Summary of changes: python/docs/source/reference/pyspark.sql.rst | 1 + python/pyspark/sql/functions.py| 29 ++ python/pyspark/sql/tests/test_functions.py | 10 +++- .../scala/org/apache/spark/sql/functions.scala | 9 +++ 4 files changed, 48 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (70fde44 -> 11de0fd)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 70fde44 [SPARK-37062][SS] Introduce a new data source for providing consistent set of rows per microbatch add 11de0fd [MINOR][DOCS] Add import for MultivariateGaussian to Docs No new revisions were added by this update. Summary of changes: python/pyspark/ml/stat.py | 1 + 1 file changed, 1 insertion(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-37062][SS] Introduce a new data source for providing consistent set of rows per microbatch
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 70fde44 [SPARK-37062][SS] Introduce a new data source for providing consistent set of rows per microbatch 70fde44 is described below commit 70fde44e930926cbcd1fc95fa7cfb915c25cff9c Author: Jungtaek Lim AuthorDate: Mon Nov 1 20:04:10 2021 +0900 [SPARK-37062][SS] Introduce a new data source for providing consistent set of rows per microbatch ### What changes were proposed in this pull request? This PR proposes to introduce a new data source having short name as "rate-micro-batch", which produces similar input rows as "rate" (increment long values with timestamps), but ensures that each micro-batch has a "predictable" set of input rows. "rate-micro-batch" data source receives a config to specify the number of rows per micro-batch, which defines the set of input rows for further micro-batches. For example, if the number of rows per micro-batch is set to 1000, the first batch would have 1000 rows having value range as `0~999`, the second batch would have 1000 rows having value range as `1000~1999`, and so on. This characteristic brings different use cases compared to rate data source, as we can't predict the input rows [...] For generated time (timestamp column), the data source applies the same mechanism to make the value of column be predictable. `startTimestamp` option defines the starting value of generated time, and `advanceMillisPerBatch` option defines how much time the generated time should advance per micro-batch. All input rows in the same micro-batch will have same timestamp. This source supports the following options: * `rowsPerBatch` (e.g. 100): How many rows should be generated per micro-batch. * `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. * `startTimestamp` (e.g. 1000, default: 0): starting value of generated time * `advanceMillisPerBatch` (e.g. 1000, default: 1000): the amount of time being advanced in generated time on each micro-batch. ### Why are the changes needed? The "rate" data source has been known to be used as a benchmark for streaming query. While this helps to put the query to the limit (how many rows the query could process per second), the rate data source doesn't provide consistent rows per batch into stream, which leads two environments be hard to compare with. For example, in many cases, you may want to compare the metrics in the batches between test environments (like running same streaming query with different options). These metrics are strongly affected if the distribution of input rows in batches are changing, especially a micro-batch has been lagged (in any reason) and rate data source produces more input rows to the next batch. Also, when you test against streaming aggregation, you may want the data source produces the same set of input rows per batch (deterministic), so that you can plan how these input rows will be aggregated and how state rows will be evicted, and craft the test query based on the plan. ### Does this PR introduce _any_ user-facing change? Yes, end users can leverage a new data source in micro-batch mode of streaming query to test/benchmark. ### How was this patch tested? New UTs, and manually tested via below query in spark-shell: ``` spark.readStream.format("rate-micro-batch").option("rowsPerBatch", 20).option("numPartitions", 3).load().writeStream.format("console").start() ``` Closes #34333 from HeartSaVioR/SPARK-37062. Authored-by: Jungtaek Lim Signed-off-by: Jungtaek Lim --- docs/structured-streaming-programming-guide.md | 13 ++ ...org.apache.spark.sql.sources.DataSourceRegister | 1 + .../sources/RatePerMicroBatchProvider.scala| 127 + .../sources/RatePerMicroBatchStream.scala | 175 ++ .../sources/RatePerMicroBatchProviderSuite.scala | 204 + 5 files changed, 520 insertions(+) diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index b36cdc7..6237d47 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -517,6 +517,8 @@ There are a few built-in sources. - **Rate source (for testing)** - Generates data at the specified number of rows per second, each output row contains a `timestamp` and `value`. Where `timestamp` is a `Timestamp` type containing the time of message dispatch, and `value` is of `Long` type containing the message count, starting from 0 as the first row. This source is intended for testing and bench