This is an automated email from the ASF dual-hosted git repository.
iemejia pushed a change to branch spark-runner_structured-streaming
in repository https://gitbox.apache.org/repos/asf/beam.git.
discard d77d229 Use "sparkMaster" in local mode to obtain number of shuffle
partitions + spotless apply
discard 66123bd Improve Pardo translation performance: avoid calling a filter
transform when there is only one output tag
discard 4b36264 Add a TODO on perf improvement of Pardo translation
discard 9cc8f17 After testing performance and correctness, launch pipeline
with dataset.foreach(). Make both test mode and production mode use foreach for
uniformity. Move dataset print as a utility method
discard aed7062 Remove no more needed AggregatorCombinerPerKey (there is only
AggregatorCombiner)
discard 9d02370 fixup! Add PipelineResults to Spark structured streaming.
discard dfe4f93 Print number of leaf datasets
discard c98460a Add spark execution plans extended debug messages.
discard 55539f5 Update log4j configuration
discard 7386c77 Add PipelineResults to Spark structured streaming.
discard 9e81526 Make spotless happy
discard 735e312 Added metrics sinks and tests
discard dc5bcae Persist all output Dataset if there are multiple outputs in
pipeline Enabled Use*Metrics tests
discard e4d2e85 Add a test to check that CombineGlobally preserves windowing
discard 4f27def Fix accumulators initialization in Combine that prevented
CombineGlobally to work.
discard efd3070 Fix javadoc
discard 0252c70 Add setEnableSparkMetricSinks() method
discard b27090b Add missing dependencies to run Spark Structured Streaming
Runner on Nexmark
discard 3f72747 Add metrics support in DoFn
discard 928d5ef Ignore for now not working test testCombineGlobally
discard 8760699 Add a test that combine per key preserves windowing
discard b8592f6 Clean groupByKeyTest
discard 9a5d813 add comment in combine globally test
discard 1b17242 Fixed immutable list bug
discard 3c4f55c Fix javadoc of AggregatorCombiner
discard 5840fa1 Clean not more needed WindowingHelpers
discard a80e8c4 Clean not more needed RowHelpers
discard c4b0610 Clean no more needed KVHelpers
discard ef47a2b Now that there is only Combine.PerKey translation, make only
one Aggregator
discard f060076 Remove CombineGlobally translation because it is less
performant than the beam sdk one (key + combinePerKey.withHotkeyFanout)
discard 007c937 Remove the mapPartition that adds a key per partition because
otherwise spark will reduce values per key instead of globally
discard 537dfcc Fix bug in the window merging logic
discard ef54b3d Fix wrong encoder in combineGlobally GBK
discard 9f83191 Fix case when a window does not merge into any other window
discard c48421d Apply a groupByKey avoids for some reason that the spark
structured streaming fmwk casts data to Row which makes it impossible to
deserialize without the coder shipped into the data. For performance reasons
(avoid memory consumption and having to deserialize), we do not ship coder +
data. Also add a mapparitions before GBK to avoid shuffling
discard d47f721 Revert extractKey while combinePerKey is not done (so that it
compiles)
discard c4c4d8d Fix encoder in combine call
discard 80ecd5b Implement merge accumulators part of CombineGlobally
translation with windowing
discard 2726c5a Output data after combine
discard fcd021b Implement reduce part of CombineGlobally translation with
windowing
discard 859b4ce Fix comment about schemas
discard e7c9bed Update KVHelpers.extractKey() to deal with WindowedValue and
update GBK and CPK
discard 06f88a6 Add TODO in Combine translations
discard 02f690f Add a test that GBK preserves windowing
discard edecedf Improve visibility of debug messages
discard 7d8ab5f re-enable reduceFnRunner timers for output
discard 53fb97b Re-code GroupByKeyTranslatorBatch to conserve windowing
instead of unwindowing/windowing(GlobalWindow): simplify code, use
ReduceFnRunner to merge the windows
discard b9c48ac Add comment about checkpoint mark
discard 81d687c Update windowAssignTest
discard f460906 Put back batch/simpleSourceTest.testBoundedSource
discard 74316dc Consider null object case on RowHelpers, fixes empty side
inputs tests.
discard 246e8c3 Pass transform based doFnSchemaInformation in ParDo
translation
discard 01dc170 Fixes ParDo not calling setup and not tearing down if
exception on startBundle
discard 7aba40d Limit the number of partitions to make tests go 300% faster
discard e2248ae Enable batch Validates Runner tests for Structured Streaming
Runner
discard a69cc67 Apply Spotless
discard aef79ae Update javadoc
discard ee67ef3 implement source.stop
discard 035d261 Ignore spark offsets (cf javadoc)
discard 0d2dc3e Use PAssert in Spark Structured Streaming transform tests
discard 16b3eb0 Rename SparkPipelineResult to
SparkStructuredStreamingPipelineResult This is done to avoid an eventual
collision with the one in SparkRunner. However this cannot happen at this
moment because it is package private, so it is also done for consistency.
discard be96e01 Add SparkStructuredStreamingPipelineOptions and
SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added
to have the new runner rely only on its specific options.
discard 6b83da4 Fix logging levels in Spark Structured Streaming translation
discard 0687f3c Fix spotless issues after rebase
discard 6e0341b Pass doFnSchemaInformation to ParDo batch translation
discard 696d17d Fix non-vendored imports from Spark Streaming Runner classes
discard dac1ce9 Merge Spark Structured Streaming runner into main Spark
module. Remove spark-structured-streaming module.Rename Runner to
SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar
for Structured Streaming Runner
discard 470085a Fix access level issues, typos and modernize code to Java 8
style
discard 2b37240 Disable never ending test SimpleSourceTest.testUnboundedSource
discard 4e87333 Apply spotless and fix spotbugs warnings
discard dddbccf Deal with checkpoint and offset based read
discard 5499990 Continue impl of offsets for streaming source
discard 041d8e3 Clean streaming source
discard e25571b Clean unneeded 0 arg constructor in batch source
discard fde0e46 Specify checkpointLocation at the pipeline start
discard f446d14 Add source streaming test
discard bac7b46 Add transformators registry in PipelineTranslatorStreaming
discard 3c9e3ad Add a TODO on spark output modes
discard 66c6090 Implement first streaming source
discard a476cb2 Add streaming source initialisation
discard 62cbf48 And unchecked warning suppression
discard 2551488 Added TODO comment for ReshuffleTranslatorBatch
discard d7b8ea2 Added using CachedSideInputReader
discard 4d540fe Don't use Reshuffle translation
discard dd3b0ef Fix CheckStyle violations
discard 5dd6884 Added SideInput support
discard 077a563 Fix javadoc
discard 7e39e2a Implement WindowAssignTest
discard f5a3a6a Implement WindowAssignTranslatorBatch
discard 1f772db Cleaning
discard bb210c4 Fix encoder bug in combinePerkey
discard 4f63965 Add explanation about receiving a Row as input in the combiner
discard b8cf669 Use more generic Row instead of GenericRowWithSchema
discard ab06b8d Fix combine. For unknown reason GenericRowWithSchema is used
as input of combine so extract its content to be able to proceed
discard 143f09f Update test with Long
discard e11a72d Fix various type checking issues in Combine.Globally
discard f4a193c Get back to classes in translators resolution because URNs
cannot translate Combine.Globally
discard 4a1fd5b Cleaning
discard e870813 Add CombineGlobally translation to avoid translating
Combine.perKey as a composite transform based on Combine.PerKey (which uses low
perf GBK)
discard 6d1301c Introduce RowHelpers
discard e0de5d4 Add combinePerKey and CombineGlobally tests
discard ad8b109 Fix combiner using KV as input, use binary encoders in place
of accumulatorEncoder and outputEncoder, use helpers, spotless
discard f145076 Introduce WindowingHelpers (and helpers package) and use it
in Pardo, GBK and CombinePerKey
discard 788aefb Improve type checking of Tuple2 encoder
discard 8429fb6 First version of combinePerKey
discard 4f1676d Extract binary schema creation in a helper class
discard 4a91a3c Fix getSideInputs
discard 3d2a05e Generalize the use of SerializablePipelineOptions in place of
(not serializable) PipelineOptions
discard c17ce6e Rename SparkDoFnFilterFunction to DoFnFilterFunction for
consistency
discard 181e4c5 Add a test for the most simple possible Combine
discard 873bc95 Added "testTwoPardoInRow"
discard 17b520b Fix for test elements container in GroupByKeyTest
discard 898b299 Rename SparkOutputManager for consistency
discard 424d39c Fix kryo issue in GBK translator with a workaround
discard 2bbd8c3 Simplify logic of ParDo translator
discard 10b07ad Don't use deprecated sideInput.getWindowingStrategyInternal()
discard dfac8e3 Rename SparkSideInputReader class and rename pruneOutput() to
pruneOutputFilteredByTag()
discard fc9eed2 Fixed Javadoc error
discard 12a86ec Apply spotless
discard c805783 Fix Encoders: create an Encoder for every manipulated type to
avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders
with generic types such as WindowedValue<T> and get the type checking back
discard 8f8fff6 Fail in case of having SideInouts or State/Timers
discard 951efec Add ComplexSourceTest
discard 58e9108 Remove no more needed putDatasetRaw
discard a2ed2db Port latest changes of ReadSourceTranslatorBatch to
ReadSourceTranslatorStreaming
discard 54a88b6 Fix type checking with Encoder of WindowedValue<T>
discard 409f3cb Add comments and TODO to GroupByKeyTranslatorBatch
discard c7116de Add GroupByKeyTest
discard baf6885 Clean
discard b77a225 Address minor review notes
discard 6005962 Add ParDoTest
discard a98b158 Clean
discard c3e448c Fix split bug
discard 4f0b82d Remove bundleSize parameter and always use spark default
parallelism
discard d8b1531 Cleaning
discard f530a62 Fix testMode output to comply with new binary schema
discard ca5a2f8 Fix errorprone
discard a0df1f9 Comment schema choices
discard 6e00851 Serialize windowedValue to byte[] in source to be able to
specify a binary dataset schema and deserialize windowedValue from Row to get a
dataset<WindowedValue>
discard 739ad71 First attempt for ParDo primitive implementation
discard 188e10d Add flatten test
discard 1ef2566 Enable gradle build scan
discard 6a683a3 Enable test mode
discard ca02796 Put all transform translators Serializable
discard ab8b22e Simplify beam reader creation as it created once the source
as already been partitioned
discard df8d0ef Fix SourceTest
discard 70b145a Move SourceTest to same package as tested class
discard f369ddb Add serialization test
discard e28e3fd Add SerializationDebugger
discard d99dcc3 Fix serialization issues
discard 0e9d7fa Clean unneeded fields in DatasetReader
discard 591aaf2 improve readability of options passing to the source
discard f9bdc19 Fix pipeline triggering: use a spark action instead of
writing the dataset
discard ab9acc2 Refactor SourceTest to a UTest instaed of a main
discard 38e918a Checkstyle and Findbugs
discard f53becb Clean
discard 6338e78 Add empty 0-arg constructor for mock source
discard 2424be8 Add a dummy schema for reader
discard 59b56bc Apply spotless and fix checkstyle
discard 5829530 Use new PipelineOptionsSerializationUtils
discard e944d40 Add missing 0-arg public constructor
discard 86f50a4 Wire real SourceTransform and not mock and update the test
discard 8288575 Refactor DatasetSource fields
discard 73eb96e Pass Beam Source and PipelineOptions to the spark DataSource
as serialized strings
discard 3668323 Move Source and translator mocks to a mock package.
discard ddbc0fc Add ReadSourceTranslatorStreaming
discard deee71f Clean
discard 40d9059 Use raw Encoder<WindowedValue> also in regular
ReadSourceTranslatorBatch
discard af423ad Split batch and streaming sources and translators
discard 46b6b75 Run pipeline in batch mode or in streaming mode
discard c3d886f Move DatasetSourceMock to proper batch mode
discard 99d06c5 clean deps
discard 32a0f31 Use raw WindowedValue so that spark Encoders could work
(temporary)
discard eaf06e8 fix mock, wire mock in translators and create a main test.
discard 3fff9c7 Add source mocks
discard ecd15ff Experiment over using spark Catalog to pass in Beam Source
through spark Table
discard fa8b8cc Improve type enforcement in ReadSourceTranslator
discard 6916dcc Improve exception flow
discard fd72ddd start source instanciation
discard 0ee3697 Apply spotless
discard dd1bb28 update TODO
discard 37357d5 Implement read transform
discard b88d12c Use Iterators.transform() to return Iterable
discard 58d9f0d Add primitive GroupByKeyTranslatorBatch implementation
discard e4afbb4 Add Flatten transformation translator
discard 3be3536 Create Datasets manipulation methods
discard 2f6ace4 Create PCollections manipulation methods
discard 663aeec Add basic pipeline execution. Refactor translatePipeline() to
return the translationContext on which we can run startPipeline()
discard f80db00 Added SparkRunnerRegistrar
discard 63e4b97 Add precise TODO for multiple TransformTranslator per
transform URN
discard 4a22ca4 Post-pone batch qualifier in all classes names for readability
discard 6252c89 Add TODOs
discard c223c7e Make codestyle and firebug happy
discard 22c99d2 apply spotless
discard ee7d8ec Move common translation context components to superclass
discard 9e035a0 Move SparkTransformOverrides to correct package
discard 30b0117 Improve javadocs
discard 21cc0b5 Make transform translation clearer: renaming, comments
discard c5f43ab Refactoring: -move batch/streaming common translation visitor
and utility methods to PipelineTranslator -rename batch dedicated classes to
Batch* to differentiate with their streaming counterparts -Introduce
TranslationContext for common batch/streaming components
discard 2f0309b Initialise BatchTranslationContext
discard 070a6d7 Organise methods in PipelineTranslator
discard c52ea32 Renames: better differenciate pipeline translator for
transform translator
discard 684b735 Wire node translators with pipeline translator
discard 3bb80f7 Add nodes translators structure
discard 4247c23 Add global pipeline translation structure
discard 23c42d3 Start pipeline translation
discard 44d3fab Add SparkPipelineOptions
discard 7f33be3 Fix missing dep
discard b703f96 Add an empty spark-structured-streaming runner project
targeting spark 2.4.0
add 1a4a4f0 Don't map transforms to pubsub subscriptions unless neccessary
add 0226799 Merge pull request #9146 from
ostrokach/bugfix/pubsub_reader_global_scope
add 7ea4547 Make SDK worker resilient to bad logging services. (#9214)
add 11f9ca5 [BEAM-7844] Custom MetadataHandler for NodeStats
(RowRateWindow)
add e5366eb Merge pull request #9185 from riazela/RowRateWindow
add b8cd3ee Update website for v2.14.0
add 13c4bb6 Merge pull request #9157 from akedin/website-update-v214
add c6eb09c [BEAM-7813] FileIO.write() doesn't fail when writing to
invalid bucket
add 850e846 Merge pull request #9175: [BEAM-7813] FileIO.write() doesn't
fail when writing to invalid bucket
add 7c51e13 [BEAM-6611] Move limitations comment to TriggerCopyJobs
add 2129c31 Merge pull request #9216 from ttanay/bq-copy-comment
add a40a278 Added 2.14.0 blog post draft
add bdf3ea7 fixup
add a3ee812 fixup
add 3a7d1db Add known issues
add cab46a0 fixup
add 22ad0c2 Updated the download page anchor
add 1d2de66 Add ApproximateUnique link
add a866d6c Merge pull request #9201 from aaltay/bl214
add dd79f79 [SQL] Use reflection to instantiate planner.
add 20bb131 Merge pull request #9221 from apilloud/sql-reflection
add 48326ba Fix date on 2.14.0 download links
add b14bbd4 Merge pull request #9226 from akedin/fix-214-date
add 851a4a5 Merge pull request #8929: [BEAM-7620] Make
run_rc_validation.sh non-interactive
add c1aec62 [BEAM-7876] create_svg from graphviz returns bytes and need
to be converted to string in order for the interactive runner to work on Python3
add fa45914 Merge pull request #9220 from davidyan74/BEAM-7876
add 96d947a [BEAM-7828] Fixes Key type conversion from/to client entity
in Python Datastore IO (#9174)
add 98b2bd9 [BEAM-7664] Add all GBK Flink Python test cases to job.
add d89517f [BEAM-7664] Changed worker amount, added scaling cluster and
renamed jobs
add dace04f [BEAM-7664] Fix typo in Combine Flink test
add 544682f Merge pull request #9106 from
kkucharc/BEAM-7664-add-more-gbk-flink-test-cases
add c41cffd [BEAM-7878] Refine Spark runner dependencies
add 6ab9ae2 Merge pull request #9225: [BEAM-7878] Refine Spark runner
dependencies
new 10ae821 Add an empty spark-structured-streaming runner project
targeting spark 2.4.0
new 3519f0f Fix missing dep
new b766961 Add SparkPipelineOptions
new ed1013d Start pipeline translation
new 806d1b8 Add global pipeline translation structure
new c555920 Add nodes translators structure
new ccf25c2 Wire node translators with pipeline translator
new 71738cb Renames: better differenciate pipeline translator for
transform translator
new 92a63de Organise methods in PipelineTranslator
new eacf593 Initialise BatchTranslationContext
new 992883b Refactoring: -move batch/streaming common translation visitor
and utility methods to PipelineTranslator -rename batch dedicated classes to
Batch* to differentiate with their streaming counterparts -Introduce
TranslationContext for common batch/streaming components
new 9930dc4 Make transform translation clearer: renaming, comments
new 03fe418 Improve javadocs
new 1cb7f3b Move SparkTransformOverrides to correct package
new 4ccc2cb Move common translation context components to superclass
new d0137b9 apply spotless
new 47b05fd Make codestyle and firebug happy
new 63d3ce0 Add TODOs
new 07090b2 Post-pone batch qualifier in all classes names for readability
new 97eb0de Add precise TODO for multiple TransformTranslator per
transform URN
new 11d8b23 Added SparkRunnerRegistrar
new 8a2bad8 Add basic pipeline execution. Refactor translatePipeline() to
return the translationContext on which we can run startPipeline()
new 4179dbd Create PCollections manipulation methods
new a041977 Create Datasets manipulation methods
new 9402def Add Flatten transformation translator
new fd56501 Add primitive GroupByKeyTranslatorBatch implementation
new 06a9891 Use Iterators.transform() to return Iterable
new 6ca4265 Implement read transform
new ba25b20 update TODO
new e1c37ac Apply spotless
new af4cd26 start source instanciation
new 39a8b7d Improve exception flow
new a07ce40 Improve type enforcement in ReadSourceTranslator
new e084b11 Experiment over using spark Catalog to pass in Beam Source
through spark Table
new 6682665 Add source mocks
new 4da27b0 fix mock, wire mock in translators and create a main test.
new a2a0665 Use raw WindowedValue so that spark Encoders could work
(temporary)
new 5bd0039 clean deps
new 2c0fd81 Move DatasetSourceMock to proper batch mode
new 85b33a5 Run pipeline in batch mode or in streaming mode
new 3088c3c Split batch and streaming sources and translators
new 312333f Use raw Encoder<WindowedValue> also in regular
ReadSourceTranslatorBatch
new 5e00117 Clean
new 35568ab Add ReadSourceTranslatorStreaming
new 2cb1b75 Move Source and translator mocks to a mock package.
new 5b9e4b9 Pass Beam Source and PipelineOptions to the spark DataSource
as serialized strings
new 5cab09d Refactor DatasetSource fields
new 3bf20b5 Wire real SourceTransform and not mock and update the test
new c029ebe Add missing 0-arg public constructor
new 2b4ffdf Use new PipelineOptionsSerializationUtils
new 9cab04c Apply spotless and fix checkstyle
new e9e6692 Add a dummy schema for reader
new 182babd Add empty 0-arg constructor for mock source
new d550ecf Clean
new d45bfbb Checkstyle and Findbugs
new 7a9c2c6 Refactor SourceTest to a UTest instaed of a main
new 874565f Fix pipeline triggering: use a spark action instead of
writing the dataset
new af8202e improve readability of options passing to the source
new 5043370 Clean unneeded fields in DatasetReader
new fb04533 Fix serialization issues
new 6e48777 Add SerializationDebugger
new 253c9bd Add serialization test
new 465040b Move SourceTest to same package as tested class
new b005fd0 Fix SourceTest
new 1f199db Simplify beam reader creation as it created once the source
as already been partitioned
new b9ecb10 Put all transform translators Serializable
new ab3b891 Enable test mode
new b86e2c9 Enable gradle build scan
new 893231a Add flatten test
new 1933059 First attempt for ParDo primitive implementation
new fe966cf Serialize windowedValue to byte[] in source to be able to
specify a binary dataset schema and deserialize windowedValue from Row to get a
dataset<WindowedValue>
new 0434229 Comment schema choices
new 54190bf Fix errorprone
new b11be3e Fix testMode output to comply with new binary schema
new 0586a34 Cleaning
new b1e0801 Remove bundleSize parameter and always use spark default
parallelism
new dba34e1 Fix split bug
new e7a00db Clean
new 3563aba Add ParDoTest
new 70c0b37 Address minor review notes
new 584563c Clean
new f5415fb Add GroupByKeyTest
new 8f7610b Add comments and TODO to GroupByKeyTranslatorBatch
new 8bad879 Fix type checking with Encoder of WindowedValue<T>
new ffa8d02 Port latest changes of ReadSourceTranslatorBatch to
ReadSourceTranslatorStreaming
new 94ad34e Remove no more needed putDatasetRaw
new 093cd1e Add ComplexSourceTest
new 97a7a3e Fail in case of having SideInouts or State/Timers
new 21d66a7 Fix Encoders: create an Encoder for every manipulated type to
avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders
with generic types such as WindowedValue<T> and get the type checking back
new 3cabda9 Apply spotless
new 0a52721 Fixed Javadoc error
new f4b21a8 Rename SparkSideInputReader class and rename pruneOutput() to
pruneOutputFilteredByTag()
new 1c3575f Don't use deprecated sideInput.getWindowingStrategyInternal()
new eb57488 Simplify logic of ParDo translator
new cea7019 Fix kryo issue in GBK translator with a workaround
new 91db78e Rename SparkOutputManager for consistency
new bc27442 Fix for test elements container in GroupByKeyTest
new 059a45d Added "testTwoPardoInRow"
new 7a608ce Add a test for the most simple possible Combine
new 2de632f Rename SparkDoFnFilterFunction to DoFnFilterFunction for
consistency
new 2231e1a Generalize the use of SerializablePipelineOptions in place of
(not serializable) PipelineOptions
new 0deca18 Fix getSideInputs
new 3476c9e Extract binary schema creation in a helper class
new ecaf6c8 First version of combinePerKey
new 31d80c4 Improve type checking of Tuple2 encoder
new 5b39a09 Introduce WindowingHelpers (and helpers package) and use it
in Pardo, GBK and CombinePerKey
new e19f0c4 Fix combiner using KV as input, use binary encoders in place
of accumulatorEncoder and outputEncoder, use helpers, spotless
new 7447090 Add combinePerKey and CombineGlobally tests
new 6d19f9f Introduce RowHelpers
new 0b96f4d Add CombineGlobally translation to avoid translating
Combine.perKey as a composite transform based on Combine.PerKey (which uses low
perf GBK)
new 91f910f Cleaning
new c3565ee Get back to classes in translators resolution because URNs
cannot translate Combine.Globally
new 2608e96 Fix various type checking issues in Combine.Globally
new 2b4bfd3 Update test with Long
new c72aa4a Fix combine. For unknown reason GenericRowWithSchema is used
as input of combine so extract its content to be able to proceed
new 5a6ae19 Use more generic Row instead of GenericRowWithSchema
new ba1b628 Add explanation about receiving a Row as input in the combiner
new 2ece1d7 Fix encoder bug in combinePerkey
new 0d2c5ca Cleaning
new cb1c0dd Implement WindowAssignTranslatorBatch
new 21bb0a6 Implement WindowAssignTest
new 9dd08b7 Fix javadoc
new c00e1c4 Added SideInput support
new 753913e Fix CheckStyle violations
new 64ef766 Don't use Reshuffle translation
new ef58a99 Added using CachedSideInputReader
new 53c6409 Added TODO comment for ReshuffleTranslatorBatch
new b912e8d And unchecked warning suppression
new 9394fb6 Add streaming source initialisation
new c9f5de8 Implement first streaming source
new 6552a53 Add a TODO on spark output modes
new 56260a1 Add transformators registry in PipelineTranslatorStreaming
new 3c5181f Add source streaming test
new 96898cc Specify checkpointLocation at the pipeline start
new 646b416 Clean unneeded 0 arg constructor in batch source
new 5f13f11 Clean streaming source
new b366b7f Continue impl of offsets for streaming source
new 6d717d9 Deal with checkpoint and offset based read
new abaf2e2 Apply spotless and fix spotbugs warnings
new 9143b88 Disable never ending test SimpleSourceTest.testUnboundedSource
new 96b9e8f Fix access level issues, typos and modernize code to Java 8
style
new ae3526c Merge Spark Structured Streaming runner into main Spark
module. Remove spark-structured-streaming module.Rename Runner to
SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar
for Structured Streaming Runner
new 63364db Fix non-vendored imports from Spark Streaming Runner classes
new 26974f0 Pass doFnSchemaInformation to ParDo batch translation
new b3cda3c Fix spotless issues after rebase
new e9da167 Fix logging levels in Spark Structured Streaming translation
new 57ca802 Add SparkStructuredStreamingPipelineOptions and
SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added
to have the new runner rely only on its specific options.
new 66f3928 Rename SparkPipelineResult to
SparkStructuredStreamingPipelineResult This is done to avoid an eventual
collision with the one in SparkRunner. However this cannot happen at this
moment because it is package private, so it is also done for consistency.
new e889206 Use PAssert in Spark Structured Streaming transform tests
new 9a460d0 Ignore spark offsets (cf javadoc)
new 07a123b implement source.stop
new 1f54355 Update javadoc
new 91a7fc6 Apply Spotless
new 0193941 Enable batch Validates Runner tests for Structured Streaming
Runner
new 55b98f8 Limit the number of partitions to make tests go 300% faster
new 86d03a4 Fixes ParDo not calling setup and not tearing down if
exception on startBundle
new a990264 Pass transform based doFnSchemaInformation in ParDo
translation
new 988a6cf Consider null object case on RowHelpers, fixes empty side
inputs tests.
new 6468c43 Put back batch/simpleSourceTest.testBoundedSource
new 018d781 Update windowAssignTest
new 258d565 Add comment about checkpoint mark
new 60afc9c Re-code GroupByKeyTranslatorBatch to conserve windowing
instead of unwindowing/windowing(GlobalWindow): simplify code, use
ReduceFnRunner to merge the windows
new b4c70d9 re-enable reduceFnRunner timers for output
new 9721a02 Improve visibility of debug messages
new 8b5424a Add a test that GBK preserves windowing
new 653351f Add TODO in Combine translations
new 5e4e034 Update KVHelpers.extractKey() to deal with WindowedValue and
update GBK and CPK
new 55c6f20 Fix comment about schemas
new 3c5eb1f Implement reduce part of CombineGlobally translation with
windowing
new 7b4ba5b Output data after combine
new 200e6ce Implement merge accumulators part of CombineGlobally
translation with windowing
new ec7ac95 Fix encoder in combine call
new 8ad7d4f Revert extractKey while combinePerKey is not done (so that it
compiles)
new 6353f9c Apply a groupByKey avoids for some reason that the spark
structured streaming fmwk casts data to Row which makes it impossible to
deserialize without the coder shipped into the data. For performance reasons
(avoid memory consumption and having to deserialize), we do not ship coder +
data. Also add a mapparitions before GBK to avoid shuffling
new 29a8e83 Fix case when a window does not merge into any other window
new be71f05 Fix wrong encoder in combineGlobally GBK
new 47f3aa5 Fix bug in the window merging logic
new 4e9c9e1 Remove the mapPartition that adds a key per partition because
otherwise spark will reduce values per key instead of globally
new 5adeca5 Remove CombineGlobally translation because it is less
performant than the beam sdk one (key + combinePerKey.withHotkeyFanout)
new 3ff116a Now that there is only Combine.PerKey translation, make only
one Aggregator
new 11d504b Clean no more needed KVHelpers
new 33f1a39 Clean not more needed RowHelpers
new 1729bb7 Clean not more needed WindowingHelpers
new 33c1554 Fix javadoc of AggregatorCombiner
new d263ff3 Fixed immutable list bug
new 94a7be6 add comment in combine globally test
new 9a7e5a3 Clean groupByKeyTest
new 7dde4fb Add a test that combine per key preserves windowing
new 8943636 Ignore for now not working test testCombineGlobally
new 6b66509 Add metrics support in DoFn
new 9472863 Add missing dependencies to run Spark Structured Streaming
Runner on Nexmark
new 186e8ac Add setEnableSparkMetricSinks() method
new b11bd8b Fix javadoc
new 6d4a75d Fix accumulators initialization in Combine that prevented
CombineGlobally to work.
new f1f58a9 Add a test to check that CombineGlobally preserves windowing
new 9879bc9 Persist all output Dataset if there are multiple outputs in
pipeline Enabled Use*Metrics tests
new 3fd985f Added metrics sinks and tests
new 8a9d97c Make spotless happy
new ba68f6c Add PipelineResults to Spark structured streaming.
new eb3c7d7 Update log4j configuration
new 2be3d94 Add spark execution plans extended debug messages.
new 8600bad Print number of leaf datasets
new d6830fb fixup! Add PipelineResults to Spark structured streaming.
new 9b8fd33 Remove no more needed AggregatorCombinerPerKey (there is only
AggregatorCombiner)
new 53374e1 After testing performance and correctness, launch pipeline
with dataset.foreach(). Make both test mode and production mode use foreach for
uniformity. Move dataset print as a utility method
new 9472a41 Add a TODO on perf improvement of Pardo translation
new 9de62ab Improve Pardo translation performance: avoid calling a filter
transform when there is only one output tag
new 6bf5a93 Use "sparkMaster" in local mode to obtain number of shuffle
partitions + spotless apply
This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version. This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:
* -- * -- B -- O -- O -- O (d77d229)
\
N -- N -- N refs/heads/spark-runner_structured-streaming
(6bf5a93)
You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.
Any revisions marked "omit" are not gone; other references still
refer to them. Any revisions marked "discard" are gone forever.
The 208 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
.test-infra/jenkins/Infrastructure.groovy | 1 +
.../job_LoadTests_Combine_Flink_Python.groovy | 2 +-
.../jenkins/job_LoadTests_GBK_Flink_Python.groovy | 155 +++++-
.../org/apache/beam/gradle/BeamModulePlugin.groovy | 2 +-
release/src/main/scripts/run_rc_validation.sh | 540 +++++++++------------
release/src/main/scripts/script.config | 135 ++++++
runners/spark/build.gradle | 11 +-
.../main/java/org/apache/beam/sdk/io/TextIO.java | 3 +-
.../beam/sdk/extensions/sql/impl/BeamSqlEnv.java | 9 +-
.../extensions/sql/impl/CalciteQueryPlanner.java | 7 +-
.../sdk/extensions/sql/impl/planner/NodeStats.java | 86 ++++
.../sql/impl/planner/NodeStatsMetadata.java | 55 +++
.../sql/impl/planner/RelMdNodeStats.java | 84 ++++
.../sql/impl/rel/BeamAggregationRel.java | 7 +
.../sdk/extensions/sql/impl/rel/BeamCalcRel.java | 6 +
.../sdk/extensions/sql/impl/rel/BeamIOSinkRel.java | 7 +
.../extensions/sql/impl/rel/BeamIOSourceRel.java | 6 +
.../extensions/sql/impl/rel/BeamIntersectRel.java | 7 +
.../sdk/extensions/sql/impl/rel/BeamJoinRel.java | 6 +
.../sdk/extensions/sql/impl/rel/BeamMinusRel.java | 7 +
.../sdk/extensions/sql/impl/rel/BeamRelNode.java | 4 +
.../sdk/extensions/sql/impl/rel/BeamSortRel.java | 7 +
.../extensions/sql/impl/rel/BeamUncollectRel.java | 7 +
.../sdk/extensions/sql/impl/rel/BeamUnionRel.java | 7 +
.../sdk/extensions/sql/impl/rel/BeamUnnestRel.java | 7 +
.../sdk/extensions/sql/impl/rel/BeamValuesRel.java | 7 +
.../extensions/sql/impl/planner/NodeStatsTest.java | 79 +++
.../org/apache/beam/sdk/io/parquet/ParquetIO.java | 1 +
.../java/org/apache/beam/sdk/io/xml/XmlIO.java | 1 +
sdks/python/apache_beam/io/gcp/bigquery.py | 5 -
.../apache_beam/io/gcp/bigquery_file_loads.py | 4 +
.../io/gcp/datastore/v1new/datastoreio.py | 11 +-
.../io/gcp/datastore/v1new/datastoreio_test.py | 16 +
.../apache_beam/io/gcp/datastore/v1new/types.py | 25 +-
.../io/gcp/datastore/v1new/types_test.py | 13 +-
.../runners/direct/transform_evaluator.py | 62 +--
.../interactive/display/pipeline_graph_renderer.py | 2 +-
.../apache_beam/runners/worker/log_handler.py | 26 +-
.../apache_beam/runners/worker/sdk_worker_main.py | 25 +-
website/_config.yml | 2 +-
website/src/.htaccess | 2 +-
website/src/_data/authors.yml | 3 +
website/src/_posts/2019-07-31-beam-2.14.0.md | 106 ++++
website/src/get-started/downloads.md | 7 +
44 files changed, 1152 insertions(+), 413 deletions(-)
create mode 100755 release/src/main/scripts/script.config
create mode 100644
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/NodeStats.java
create mode 100644
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/NodeStatsMetadata.java
create mode 100644
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/RelMdNodeStats.java
create mode 100644
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/planner/NodeStatsTest.java
create mode 100644 website/src/_posts/2019-07-31-beam-2.14.0.md