[beam] branch spark-runner_structured-streaming updated (d77d229 -> 6bf5a93)

iemejia Fri, 02 Aug 2019 01:05:41 -0700

This is an automated email from the ASF dual-hosted git repository.

iemejia pushed a change to branch spark-runner_structured-streaming
in repository https://gitbox.apache.org/repos/asf/beam.git.



 discard d77d229  Use "sparkMaster" in local mode to obtain number of shuffle 
partitions + spotless apply
 discard 66123bd  Improve Pardo translation performance: avoid calling a filter 
transform when there is only one output tag
 discard 4b36264  Add a TODO on perf improvement of Pardo translation
 discard 9cc8f17  After testing performance and correctness, launch pipeline 
with dataset.foreach(). Make both test mode and production mode use foreach for 
uniformity. Move dataset print as a utility method
 discard aed7062  Remove no more needed AggregatorCombinerPerKey (there is only 
AggregatorCombiner)
 discard 9d02370  fixup! Add PipelineResults to Spark structured streaming.
 discard dfe4f93  Print number of leaf datasets
 discard c98460a  Add spark execution plans extended debug messages.
 discard 55539f5  Update log4j configuration
 discard 7386c77  Add PipelineResults to Spark structured streaming.
 discard 9e81526  Make spotless happy
 discard 735e312  Added metrics sinks and tests
 discard dc5bcae  Persist all output Dataset if there are multiple outputs in 
pipeline Enabled Use*Metrics tests
 discard e4d2e85  Add a test to check that CombineGlobally preserves windowing
 discard 4f27def  Fix accumulators initialization in Combine that prevented 
CombineGlobally to work.
 discard efd3070  Fix javadoc
 discard 0252c70  Add setEnableSparkMetricSinks() method
 discard b27090b  Add missing dependencies to run Spark Structured Streaming 
Runner on Nexmark
 discard 3f72747  Add metrics support in DoFn
 discard 928d5ef  Ignore for now not working test testCombineGlobally
 discard 8760699  Add a test that combine per key preserves windowing
 discard b8592f6  Clean groupByKeyTest
 discard 9a5d813  add comment in combine globally test
 discard 1b17242  Fixed immutable list bug
 discard 3c4f55c  Fix javadoc of AggregatorCombiner
 discard 5840fa1  Clean not more needed WindowingHelpers
 discard a80e8c4  Clean not more needed RowHelpers
 discard c4b0610  Clean no more needed KVHelpers
 discard ef47a2b  Now that there is only Combine.PerKey translation, make only 
one Aggregator
 discard f060076  Remove CombineGlobally translation because it is less 
performant than the beam sdk one (key + combinePerKey.withHotkeyFanout)
 discard 007c937  Remove the mapPartition that adds a key per partition because 
otherwise spark will reduce values per key instead of globally
 discard 537dfcc  Fix bug in the window merging logic
 discard ef54b3d  Fix wrong encoder in combineGlobally GBK
 discard 9f83191  Fix case when a window does not merge into any other window
 discard c48421d  Apply a groupByKey avoids for some reason that the spark 
structured streaming fmwk casts data to Row which makes it impossible to 
deserialize without the coder shipped into the data. For performance reasons 
(avoid memory consumption and having to deserialize), we do not ship coder + 
data. Also add a mapparitions before GBK to avoid shuffling
 discard d47f721  Revert extractKey while combinePerKey is not done (so that it 
compiles)
 discard c4c4d8d  Fix encoder in combine call
 discard 80ecd5b  Implement merge accumulators part of CombineGlobally 
translation with windowing
 discard 2726c5a  Output data after combine
 discard fcd021b  Implement reduce part of CombineGlobally translation with 
windowing
 discard 859b4ce  Fix comment about schemas
 discard e7c9bed  Update KVHelpers.extractKey() to deal with WindowedValue and 
update GBK and CPK
 discard 06f88a6  Add TODO in Combine translations
 discard 02f690f  Add a test that GBK preserves windowing
 discard edecedf  Improve visibility of debug messages
 discard 7d8ab5f  re-enable reduceFnRunner timers for output
 discard 53fb97b  Re-code GroupByKeyTranslatorBatch to conserve windowing 
instead of unwindowing/windowing(GlobalWindow): simplify code, use 
ReduceFnRunner to merge the windows
 discard b9c48ac  Add comment about checkpoint mark
 discard 81d687c  Update windowAssignTest
 discard f460906  Put back batch/simpleSourceTest.testBoundedSource
 discard 74316dc  Consider null object case on RowHelpers, fixes empty side 
inputs tests.
 discard 246e8c3  Pass transform based doFnSchemaInformation in ParDo 
translation
 discard 01dc170  Fixes ParDo not calling setup and not tearing down if 
exception on startBundle
 discard 7aba40d  Limit the number of partitions to make tests go 300% faster
 discard e2248ae  Enable batch Validates Runner tests for Structured Streaming 
Runner
 discard a69cc67  Apply Spotless
 discard aef79ae  Update javadoc
 discard ee67ef3  implement source.stop
 discard 035d261  Ignore spark offsets (cf javadoc)
 discard 0d2dc3e  Use PAssert in Spark Structured Streaming transform tests
 discard 16b3eb0  Rename SparkPipelineResult to 
SparkStructuredStreamingPipelineResult This is done to avoid an eventual 
collision with the one in SparkRunner. However this cannot happen at this 
moment because it is package private, so it is also done for consistency.
 discard be96e01  Add SparkStructuredStreamingPipelineOptions and 
SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added 
to have the new   runner rely only on its specific options.
 discard 6b83da4  Fix logging levels in Spark Structured Streaming translation
 discard 0687f3c  Fix spotless issues after rebase
 discard 6e0341b  Pass doFnSchemaInformation to ParDo batch translation
 discard 696d17d  Fix non-vendored imports from Spark Streaming Runner classes
 discard dac1ce9  Merge Spark Structured Streaming runner into main Spark 
module. Remove spark-structured-streaming module.Rename Runner to 
SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar 
for Structured Streaming Runner
 discard 470085a  Fix access level issues, typos and modernize code to Java 8 
style
 discard 2b37240  Disable never ending test SimpleSourceTest.testUnboundedSource
 discard 4e87333  Apply spotless and fix spotbugs warnings
 discard dddbccf  Deal with checkpoint and offset based read
 discard 5499990  Continue impl of offsets for streaming source
 discard 041d8e3  Clean streaming source
 discard e25571b  Clean unneeded 0 arg constructor in batch source
 discard fde0e46  Specify checkpointLocation at the pipeline start
 discard f446d14  Add source streaming test
 discard bac7b46  Add transformators registry in PipelineTranslatorStreaming
 discard 3c9e3ad  Add a TODO on spark output modes
 discard 66c6090  Implement first streaming source
 discard a476cb2  Add streaming source initialisation
 discard 62cbf48  And unchecked warning suppression
 discard 2551488  Added TODO comment for ReshuffleTranslatorBatch
 discard d7b8ea2  Added using CachedSideInputReader
 discard 4d540fe  Don't use Reshuffle translation
 discard dd3b0ef  Fix CheckStyle violations
 discard 5dd6884  Added SideInput support
 discard 077a563  Fix javadoc
 discard 7e39e2a  Implement WindowAssignTest
 discard f5a3a6a  Implement WindowAssignTranslatorBatch
 discard 1f772db  Cleaning
 discard bb210c4  Fix encoder bug in combinePerkey
 discard 4f63965  Add explanation about receiving a Row as input in the combiner
 discard b8cf669  Use more generic Row instead of GenericRowWithSchema
 discard ab06b8d  Fix combine. For unknown reason GenericRowWithSchema is used 
as input of combine so extract its content to be able to proceed
 discard 143f09f  Update test with Long
 discard e11a72d  Fix various type checking issues in Combine.Globally
 discard f4a193c  Get back to classes in translators resolution because URNs 
cannot translate Combine.Globally
 discard 4a1fd5b  Cleaning
 discard e870813  Add CombineGlobally translation to avoid translating 
Combine.perKey as a composite transform based on Combine.PerKey (which uses low 
perf GBK)
 discard 6d1301c  Introduce RowHelpers
 discard e0de5d4  Add combinePerKey and CombineGlobally tests
 discard ad8b109  Fix combiner using KV as input, use binary encoders in place 
of accumulatorEncoder and outputEncoder, use helpers, spotless
 discard f145076  Introduce WindowingHelpers (and helpers package) and use it 
in Pardo, GBK and CombinePerKey
 discard 788aefb  Improve type checking of Tuple2 encoder
 discard 8429fb6  First version of combinePerKey
 discard 4f1676d  Extract binary schema creation in a helper class
 discard 4a91a3c  Fix getSideInputs
 discard 3d2a05e  Generalize the use of SerializablePipelineOptions in place of 
(not serializable) PipelineOptions
 discard c17ce6e  Rename SparkDoFnFilterFunction to DoFnFilterFunction for 
consistency
 discard 181e4c5  Add a test for the most simple possible Combine
 discard 873bc95  Added "testTwoPardoInRow"
 discard 17b520b  Fix for test elements container in GroupByKeyTest
 discard 898b299  Rename SparkOutputManager for consistency
 discard 424d39c  Fix kryo issue in GBK translator with a workaround
 discard 2bbd8c3  Simplify logic of ParDo translator
 discard 10b07ad  Don't use deprecated sideInput.getWindowingStrategyInternal()
 discard dfac8e3  Rename SparkSideInputReader class and rename pruneOutput() to 
pruneOutputFilteredByTag()
 discard fc9eed2  Fixed Javadoc error
 discard 12a86ec  Apply spotless
 discard c805783  Fix Encoders: create an Encoder for every manipulated type to 
avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders 
with generic types such as WindowedValue<T> and get the type checking back
 discard 8f8fff6  Fail in case of having SideInouts or State/Timers
 discard 951efec  Add ComplexSourceTest
 discard 58e9108  Remove no more needed putDatasetRaw
 discard a2ed2db  Port latest changes of ReadSourceTranslatorBatch to 
ReadSourceTranslatorStreaming
 discard 54a88b6  Fix type checking with Encoder of WindowedValue<T>
 discard 409f3cb  Add comments and TODO to GroupByKeyTranslatorBatch
 discard c7116de  Add GroupByKeyTest
 discard baf6885  Clean
 discard b77a225  Address minor review notes
 discard 6005962  Add ParDoTest
 discard a98b158  Clean
 discard c3e448c  Fix split bug
 discard 4f0b82d  Remove bundleSize parameter and always use spark default 
parallelism
 discard d8b1531  Cleaning
 discard f530a62  Fix testMode output to comply with new binary schema
 discard ca5a2f8  Fix errorprone
 discard a0df1f9  Comment schema choices
 discard 6e00851  Serialize windowedValue to byte[] in source to be able to 
specify a binary dataset schema and deserialize windowedValue from Row to get a 
dataset<WindowedValue>
 discard 739ad71  First attempt for ParDo primitive implementation
 discard 188e10d  Add flatten test
 discard 1ef2566  Enable gradle build scan
 discard 6a683a3  Enable test mode
 discard ca02796  Put all transform translators Serializable
 discard ab8b22e  Simplify beam reader creation as it created once the source 
as already been partitioned
 discard df8d0ef  Fix SourceTest
 discard 70b145a  Move SourceTest to same package as tested class
 discard f369ddb  Add serialization test
 discard e28e3fd  Add SerializationDebugger
 discard d99dcc3  Fix serialization issues
 discard 0e9d7fa  Clean unneeded fields in DatasetReader
 discard 591aaf2  improve readability of options passing to the source
 discard f9bdc19  Fix pipeline triggering: use a spark action instead of 
writing the dataset
 discard ab9acc2  Refactor SourceTest to a UTest instaed of a main
 discard 38e918a  Checkstyle and Findbugs
 discard f53becb  Clean
 discard 6338e78  Add empty 0-arg constructor for mock source
 discard 2424be8  Add a dummy schema for reader
 discard 59b56bc  Apply spotless and fix  checkstyle
 discard 5829530  Use new PipelineOptionsSerializationUtils
 discard e944d40  Add missing 0-arg public constructor
 discard 86f50a4  Wire real SourceTransform and not mock and update the test
 discard 8288575  Refactor DatasetSource fields
 discard 73eb96e  Pass Beam Source and PipelineOptions to the spark DataSource 
as serialized strings
 discard 3668323  Move Source and translator mocks to a mock package.
 discard ddbc0fc  Add ReadSourceTranslatorStreaming
 discard deee71f  Clean
 discard 40d9059  Use raw Encoder<WindowedValue> also in regular 
ReadSourceTranslatorBatch
 discard af423ad  Split batch and streaming sources and translators
 discard 46b6b75  Run pipeline in batch mode or in streaming mode
 discard c3d886f  Move DatasetSourceMock to proper batch mode
 discard 99d06c5  clean deps
 discard 32a0f31  Use raw WindowedValue so that spark Encoders could work 
(temporary)
 discard eaf06e8  fix mock, wire mock in translators and create a main test.
 discard 3fff9c7  Add source mocks
 discard ecd15ff  Experiment over using spark Catalog to pass in Beam Source 
through spark Table
 discard fa8b8cc  Improve type enforcement in ReadSourceTranslator
 discard 6916dcc  Improve exception flow
 discard fd72ddd  start source instanciation
 discard 0ee3697  Apply spotless
 discard dd1bb28  update TODO
 discard 37357d5  Implement read transform
 discard b88d12c  Use Iterators.transform() to return Iterable
 discard 58d9f0d  Add primitive GroupByKeyTranslatorBatch implementation
 discard e4afbb4  Add Flatten transformation translator
 discard 3be3536  Create Datasets manipulation methods
 discard 2f6ace4  Create PCollections manipulation methods
 discard 663aeec  Add basic pipeline execution. Refactor translatePipeline() to 
return the translationContext on which we can run startPipeline()
 discard f80db00  Added SparkRunnerRegistrar
 discard 63e4b97  Add precise TODO for multiple TransformTranslator per 
transform URN
 discard 4a22ca4  Post-pone batch qualifier in all classes names for readability
 discard 6252c89  Add TODOs
 discard c223c7e  Make codestyle and firebug happy
 discard 22c99d2  apply spotless
 discard ee7d8ec  Move common translation context components to superclass
 discard 9e035a0  Move SparkTransformOverrides to correct package
 discard 30b0117  Improve javadocs
 discard 21cc0b5  Make transform translation clearer: renaming, comments
 discard c5f43ab  Refactoring: -move batch/streaming common translation visitor 
and utility methods to PipelineTranslator -rename batch dedicated classes to 
Batch* to differentiate with their streaming counterparts -Introduce 
TranslationContext for common batch/streaming components
 discard 2f0309b  Initialise BatchTranslationContext
 discard 070a6d7  Organise methods in PipelineTranslator
 discard c52ea32  Renames: better differenciate pipeline translator for 
transform translator
 discard 684b735  Wire node translators with pipeline translator
 discard 3bb80f7  Add nodes translators structure
 discard 4247c23  Add global pipeline translation structure
 discard 23c42d3  Start pipeline translation
 discard 44d3fab  Add SparkPipelineOptions
 discard 7f33be3  Fix missing dep
 discard b703f96  Add an empty spark-structured-streaming runner project 
targeting spark 2.4.0
     add 1a4a4f0  Don't map transforms to pubsub subscriptions unless neccessary
     add 0226799  Merge pull request #9146 from 
ostrokach/bugfix/pubsub_reader_global_scope
     add 7ea4547  Make SDK worker resilient to bad logging services. (#9214)
     add 11f9ca5  [BEAM-7844] Custom MetadataHandler for NodeStats 
(RowRateWindow)
     add e5366eb  Merge pull request #9185 from riazela/RowRateWindow
     add b8cd3ee  Update website for v2.14.0
     add 13c4bb6  Merge pull request #9157 from akedin/website-update-v214
     add c6eb09c  [BEAM-7813] FileIO.write() doesn't fail when writing to 
invalid bucket
     add 850e846  Merge pull request #9175: [BEAM-7813] FileIO.write() doesn't 
fail when writing to invalid bucket
     add 7c51e13  [BEAM-6611] Move limitations comment to TriggerCopyJobs
     add 2129c31  Merge pull request #9216 from ttanay/bq-copy-comment
     add a40a278  Added 2.14.0 blog post draft
     add bdf3ea7  fixup
     add a3ee812  fixup
     add 3a7d1db  Add known issues
     add cab46a0  fixup
     add 22ad0c2  Updated the download page anchor
     add 1d2de66  Add ApproximateUnique link
     add a866d6c  Merge pull request #9201 from aaltay/bl214
     add dd79f79  [SQL] Use reflection to instantiate planner.
     add 20bb131  Merge pull request #9221 from apilloud/sql-reflection
     add 48326ba  Fix date on 2.14.0 download links
     add b14bbd4  Merge pull request #9226 from akedin/fix-214-date
     add 851a4a5  Merge pull request #8929: [BEAM-7620] Make 
run_rc_validation.sh non-interactive
     add c1aec62  [BEAM-7876] create_svg from graphviz returns bytes and need 
to be converted to string in order for the interactive runner to work on Python3
     add fa45914  Merge pull request #9220 from davidyan74/BEAM-7876
     add 96d947a  [BEAM-7828] Fixes Key type conversion from/to client entity 
in Python Datastore IO (#9174)
     add 98b2bd9  [BEAM-7664] Add all GBK Flink Python test cases to job.
     add d89517f  [BEAM-7664] Changed worker amount, added scaling cluster and 
renamed jobs
     add dace04f  [BEAM-7664] Fix typo in Combine Flink test
     add 544682f  Merge pull request #9106 from 
kkucharc/BEAM-7664-add-more-gbk-flink-test-cases
     add c41cffd  [BEAM-7878] Refine Spark runner dependencies
     add 6ab9ae2  Merge pull request #9225: [BEAM-7878] Refine Spark runner 
dependencies
     new 10ae821  Add an empty spark-structured-streaming runner project 
targeting spark 2.4.0
     new 3519f0f  Fix missing dep
     new b766961  Add SparkPipelineOptions
     new ed1013d  Start pipeline translation
     new 806d1b8  Add global pipeline translation structure
     new c555920  Add nodes translators structure
     new ccf25c2  Wire node translators with pipeline translator
     new 71738cb  Renames: better differenciate pipeline translator for 
transform translator
     new 92a63de  Organise methods in PipelineTranslator
     new eacf593  Initialise BatchTranslationContext
     new 992883b  Refactoring: -move batch/streaming common translation visitor 
and utility methods to PipelineTranslator -rename batch dedicated classes to 
Batch* to differentiate with their streaming counterparts -Introduce 
TranslationContext for common batch/streaming components
     new 9930dc4  Make transform translation clearer: renaming, comments
     new 03fe418  Improve javadocs
     new 1cb7f3b  Move SparkTransformOverrides to correct package
     new 4ccc2cb  Move common translation context components to superclass
     new d0137b9  apply spotless
     new 47b05fd  Make codestyle and firebug happy
     new 63d3ce0  Add TODOs
     new 07090b2  Post-pone batch qualifier in all classes names for readability
     new 97eb0de  Add precise TODO for multiple TransformTranslator per 
transform URN
     new 11d8b23  Added SparkRunnerRegistrar
     new 8a2bad8  Add basic pipeline execution. Refactor translatePipeline() to 
return the translationContext on which we can run startPipeline()
     new 4179dbd  Create PCollections manipulation methods
     new a041977  Create Datasets manipulation methods
     new 9402def  Add Flatten transformation translator
     new fd56501  Add primitive GroupByKeyTranslatorBatch implementation
     new 06a9891  Use Iterators.transform() to return Iterable
     new 6ca4265  Implement read transform
     new ba25b20  update TODO
     new e1c37ac  Apply spotless
     new af4cd26  start source instanciation
     new 39a8b7d  Improve exception flow
     new a07ce40  Improve type enforcement in ReadSourceTranslator
     new e084b11  Experiment over using spark Catalog to pass in Beam Source 
through spark Table
     new 6682665  Add source mocks
     new 4da27b0  fix mock, wire mock in translators and create a main test.
     new a2a0665  Use raw WindowedValue so that spark Encoders could work 
(temporary)
     new 5bd0039  clean deps
     new 2c0fd81  Move DatasetSourceMock to proper batch mode
     new 85b33a5  Run pipeline in batch mode or in streaming mode
     new 3088c3c  Split batch and streaming sources and translators
     new 312333f  Use raw Encoder<WindowedValue> also in regular 
ReadSourceTranslatorBatch
     new 5e00117  Clean
     new 35568ab  Add ReadSourceTranslatorStreaming
     new 2cb1b75  Move Source and translator mocks to a mock package.
     new 5b9e4b9  Pass Beam Source and PipelineOptions to the spark DataSource 
as serialized strings
     new 5cab09d  Refactor DatasetSource fields
     new 3bf20b5  Wire real SourceTransform and not mock and update the test
     new c029ebe  Add missing 0-arg public constructor
     new 2b4ffdf  Use new PipelineOptionsSerializationUtils
     new 9cab04c  Apply spotless and fix  checkstyle
     new e9e6692  Add a dummy schema for reader
     new 182babd  Add empty 0-arg constructor for mock source
     new d550ecf  Clean
     new d45bfbb  Checkstyle and Findbugs
     new 7a9c2c6  Refactor SourceTest to a UTest instaed of a main
     new 874565f  Fix pipeline triggering: use a spark action instead of 
writing the dataset
     new af8202e  improve readability of options passing to the source
     new 5043370  Clean unneeded fields in DatasetReader
     new fb04533  Fix serialization issues
     new 6e48777  Add SerializationDebugger
     new 253c9bd  Add serialization test
     new 465040b  Move SourceTest to same package as tested class
     new b005fd0  Fix SourceTest
     new 1f199db  Simplify beam reader creation as it created once the source 
as already been partitioned
     new b9ecb10  Put all transform translators Serializable
     new ab3b891  Enable test mode
     new b86e2c9  Enable gradle build scan
     new 893231a  Add flatten test
     new 1933059  First attempt for ParDo primitive implementation
     new fe966cf  Serialize windowedValue to byte[] in source to be able to 
specify a binary dataset schema and deserialize windowedValue from Row to get a 
dataset<WindowedValue>
     new 0434229  Comment schema choices
     new 54190bf  Fix errorprone
     new b11be3e  Fix testMode output to comply with new binary schema
     new 0586a34  Cleaning
     new b1e0801  Remove bundleSize parameter and always use spark default 
parallelism
     new dba34e1  Fix split bug
     new e7a00db  Clean
     new 3563aba  Add ParDoTest
     new 70c0b37  Address minor review notes
     new 584563c  Clean
     new f5415fb  Add GroupByKeyTest
     new 8f7610b  Add comments and TODO to GroupByKeyTranslatorBatch
     new 8bad879  Fix type checking with Encoder of WindowedValue<T>
     new ffa8d02  Port latest changes of ReadSourceTranslatorBatch to 
ReadSourceTranslatorStreaming
     new 94ad34e  Remove no more needed putDatasetRaw
     new 093cd1e  Add ComplexSourceTest
     new 97a7a3e  Fail in case of having SideInouts or State/Timers
     new 21d66a7  Fix Encoders: create an Encoder for every manipulated type to 
avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders 
with generic types such as WindowedValue<T> and get the type checking back
     new 3cabda9  Apply spotless
     new 0a52721  Fixed Javadoc error
     new f4b21a8  Rename SparkSideInputReader class and rename pruneOutput() to 
pruneOutputFilteredByTag()
     new 1c3575f  Don't use deprecated sideInput.getWindowingStrategyInternal()
     new eb57488  Simplify logic of ParDo translator
     new cea7019  Fix kryo issue in GBK translator with a workaround
     new 91db78e  Rename SparkOutputManager for consistency
     new bc27442  Fix for test elements container in GroupByKeyTest
     new 059a45d  Added "testTwoPardoInRow"
     new 7a608ce  Add a test for the most simple possible Combine
     new 2de632f  Rename SparkDoFnFilterFunction to DoFnFilterFunction for 
consistency
     new 2231e1a  Generalize the use of SerializablePipelineOptions in place of 
(not serializable) PipelineOptions
     new 0deca18  Fix getSideInputs
     new 3476c9e  Extract binary schema creation in a helper class
     new ecaf6c8  First version of combinePerKey
     new 31d80c4  Improve type checking of Tuple2 encoder
     new 5b39a09  Introduce WindowingHelpers (and helpers package) and use it 
in Pardo, GBK and CombinePerKey
     new e19f0c4  Fix combiner using KV as input, use binary encoders in place 
of accumulatorEncoder and outputEncoder, use helpers, spotless
     new 7447090  Add combinePerKey and CombineGlobally tests
     new 6d19f9f  Introduce RowHelpers
     new 0b96f4d  Add CombineGlobally translation to avoid translating 
Combine.perKey as a composite transform based on Combine.PerKey (which uses low 
perf GBK)
     new 91f910f  Cleaning
     new c3565ee  Get back to classes in translators resolution because URNs 
cannot translate Combine.Globally
     new 2608e96  Fix various type checking issues in Combine.Globally
     new 2b4bfd3  Update test with Long
     new c72aa4a  Fix combine. For unknown reason GenericRowWithSchema is used 
as input of combine so extract its content to be able to proceed
     new 5a6ae19  Use more generic Row instead of GenericRowWithSchema
     new ba1b628  Add explanation about receiving a Row as input in the combiner
     new 2ece1d7  Fix encoder bug in combinePerkey
     new 0d2c5ca  Cleaning
     new cb1c0dd  Implement WindowAssignTranslatorBatch
     new 21bb0a6  Implement WindowAssignTest
     new 9dd08b7  Fix javadoc
     new c00e1c4  Added SideInput support
     new 753913e  Fix CheckStyle violations
     new 64ef766  Don't use Reshuffle translation
     new ef58a99  Added using CachedSideInputReader
     new 53c6409  Added TODO comment for ReshuffleTranslatorBatch
     new b912e8d  And unchecked warning suppression
     new 9394fb6  Add streaming source initialisation
     new c9f5de8  Implement first streaming source
     new 6552a53  Add a TODO on spark output modes
     new 56260a1  Add transformators registry in PipelineTranslatorStreaming
     new 3c5181f  Add source streaming test
     new 96898cc  Specify checkpointLocation at the pipeline start
     new 646b416  Clean unneeded 0 arg constructor in batch source
     new 5f13f11  Clean streaming source
     new b366b7f  Continue impl of offsets for streaming source
     new 6d717d9  Deal with checkpoint and offset based read
     new abaf2e2  Apply spotless and fix spotbugs warnings
     new 9143b88  Disable never ending test SimpleSourceTest.testUnboundedSource
     new 96b9e8f  Fix access level issues, typos and modernize code to Java 8 
style
     new ae3526c  Merge Spark Structured Streaming runner into main Spark 
module. Remove spark-structured-streaming module.Rename Runner to 
SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar 
for Structured Streaming Runner
     new 63364db  Fix non-vendored imports from Spark Streaming Runner classes
     new 26974f0  Pass doFnSchemaInformation to ParDo batch translation
     new b3cda3c  Fix spotless issues after rebase
     new e9da167  Fix logging levels in Spark Structured Streaming translation
     new 57ca802  Add SparkStructuredStreamingPipelineOptions and 
SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added 
to have the new   runner rely only on its specific options.
     new 66f3928  Rename SparkPipelineResult to 
SparkStructuredStreamingPipelineResult This is done to avoid an eventual 
collision with the one in SparkRunner. However this cannot happen at this 
moment because it is package private, so it is also done for consistency.
     new e889206  Use PAssert in Spark Structured Streaming transform tests
     new 9a460d0  Ignore spark offsets (cf javadoc)
     new 07a123b  implement source.stop
     new 1f54355  Update javadoc
     new 91a7fc6  Apply Spotless
     new 0193941  Enable batch Validates Runner tests for Structured Streaming 
Runner
     new 55b98f8  Limit the number of partitions to make tests go 300% faster
     new 86d03a4  Fixes ParDo not calling setup and not tearing down if 
exception on startBundle
     new a990264  Pass transform based doFnSchemaInformation in ParDo 
translation
     new 988a6cf  Consider null object case on RowHelpers, fixes empty side 
inputs tests.
     new 6468c43  Put back batch/simpleSourceTest.testBoundedSource
     new 018d781  Update windowAssignTest
     new 258d565  Add comment about checkpoint mark
     new 60afc9c  Re-code GroupByKeyTranslatorBatch to conserve windowing 
instead of unwindowing/windowing(GlobalWindow): simplify code, use 
ReduceFnRunner to merge the windows
     new b4c70d9  re-enable reduceFnRunner timers for output
     new 9721a02  Improve visibility of debug messages
     new 8b5424a  Add a test that GBK preserves windowing
     new 653351f  Add TODO in Combine translations
     new 5e4e034  Update KVHelpers.extractKey() to deal with WindowedValue and 
update GBK and CPK
     new 55c6f20  Fix comment about schemas
     new 3c5eb1f  Implement reduce part of CombineGlobally translation with 
windowing
     new 7b4ba5b  Output data after combine
     new 200e6ce  Implement merge accumulators part of CombineGlobally 
translation with windowing
     new ec7ac95  Fix encoder in combine call
     new 8ad7d4f  Revert extractKey while combinePerKey is not done (so that it 
compiles)
     new 6353f9c  Apply a groupByKey avoids for some reason that the spark 
structured streaming fmwk casts data to Row which makes it impossible to 
deserialize without the coder shipped into the data. For performance reasons 
(avoid memory consumption and having to deserialize), we do not ship coder + 
data. Also add a mapparitions before GBK to avoid shuffling
     new 29a8e83  Fix case when a window does not merge into any other window
     new be71f05  Fix wrong encoder in combineGlobally GBK
     new 47f3aa5  Fix bug in the window merging logic
     new 4e9c9e1  Remove the mapPartition that adds a key per partition because 
otherwise spark will reduce values per key instead of globally
     new 5adeca5  Remove CombineGlobally translation because it is less 
performant than the beam sdk one (key + combinePerKey.withHotkeyFanout)
     new 3ff116a  Now that there is only Combine.PerKey translation, make only 
one Aggregator
     new 11d504b  Clean no more needed KVHelpers
     new 33f1a39  Clean not more needed RowHelpers
     new 1729bb7  Clean not more needed WindowingHelpers
     new 33c1554  Fix javadoc of AggregatorCombiner
     new d263ff3  Fixed immutable list bug
     new 94a7be6  add comment in combine globally test
     new 9a7e5a3  Clean groupByKeyTest
     new 7dde4fb  Add a test that combine per key preserves windowing
     new 8943636  Ignore for now not working test testCombineGlobally
     new 6b66509  Add metrics support in DoFn
     new 9472863  Add missing dependencies to run Spark Structured Streaming 
Runner on Nexmark
     new 186e8ac  Add setEnableSparkMetricSinks() method
     new b11bd8b  Fix javadoc
     new 6d4a75d  Fix accumulators initialization in Combine that prevented 
CombineGlobally to work.
     new f1f58a9  Add a test to check that CombineGlobally preserves windowing
     new 9879bc9  Persist all output Dataset if there are multiple outputs in 
pipeline Enabled Use*Metrics tests
     new 3fd985f  Added metrics sinks and tests
     new 8a9d97c  Make spotless happy
     new ba68f6c  Add PipelineResults to Spark structured streaming.
     new eb3c7d7  Update log4j configuration
     new 2be3d94  Add spark execution plans extended debug messages.
     new 8600bad  Print number of leaf datasets
     new d6830fb  fixup! Add PipelineResults to Spark structured streaming.
     new 9b8fd33  Remove no more needed AggregatorCombinerPerKey (there is only 
AggregatorCombiner)
     new 53374e1  After testing performance and correctness, launch pipeline 
with dataset.foreach(). Make both test mode and production mode use foreach for 
uniformity. Move dataset print as a utility method
     new 9472a41  Add a TODO on perf improvement of Pardo translation
     new 9de62ab  Improve Pardo translation performance: avoid calling a filter 
transform when there is only one output tag
     new 6bf5a93  Use "sparkMaster" in local mode to obtain number of shuffle 
partitions + spotless apply

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (d77d229)
            \
             N -- N -- N   refs/heads/spark-runner_structured-streaming 
(6bf5a93)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 208 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .test-infra/jenkins/Infrastructure.groovy          |   1 +
 .../job_LoadTests_Combine_Flink_Python.groovy      |   2 +-
 .../jenkins/job_LoadTests_GBK_Flink_Python.groovy  | 155 +++++-
 .../org/apache/beam/gradle/BeamModulePlugin.groovy |   2 +-
 release/src/main/scripts/run_rc_validation.sh      | 540 +++++++++------------
 release/src/main/scripts/script.config             | 135 ++++++
 runners/spark/build.gradle                         |  11 +-
 .../main/java/org/apache/beam/sdk/io/TextIO.java   |   3 +-
 .../beam/sdk/extensions/sql/impl/BeamSqlEnv.java   |   9 +-
 .../extensions/sql/impl/CalciteQueryPlanner.java   |   7 +-
 .../sdk/extensions/sql/impl/planner/NodeStats.java |  86 ++++
 .../sql/impl/planner/NodeStatsMetadata.java        |  55 +++
 .../sql/impl/planner/RelMdNodeStats.java           |  84 ++++
 .../sql/impl/rel/BeamAggregationRel.java           |   7 +
 .../sdk/extensions/sql/impl/rel/BeamCalcRel.java   |   6 +
 .../sdk/extensions/sql/impl/rel/BeamIOSinkRel.java |   7 +
 .../extensions/sql/impl/rel/BeamIOSourceRel.java   |   6 +
 .../extensions/sql/impl/rel/BeamIntersectRel.java  |   7 +
 .../sdk/extensions/sql/impl/rel/BeamJoinRel.java   |   6 +
 .../sdk/extensions/sql/impl/rel/BeamMinusRel.java  |   7 +
 .../sdk/extensions/sql/impl/rel/BeamRelNode.java   |   4 +
 .../sdk/extensions/sql/impl/rel/BeamSortRel.java   |   7 +
 .../extensions/sql/impl/rel/BeamUncollectRel.java  |   7 +
 .../sdk/extensions/sql/impl/rel/BeamUnionRel.java  |   7 +
 .../sdk/extensions/sql/impl/rel/BeamUnnestRel.java |   7 +
 .../sdk/extensions/sql/impl/rel/BeamValuesRel.java |   7 +
 .../extensions/sql/impl/planner/NodeStatsTest.java |  79 +++
 .../org/apache/beam/sdk/io/parquet/ParquetIO.java  |   1 +
 .../java/org/apache/beam/sdk/io/xml/XmlIO.java     |   1 +
 sdks/python/apache_beam/io/gcp/bigquery.py         |   5 -
 .../apache_beam/io/gcp/bigquery_file_loads.py      |   4 +
 .../io/gcp/datastore/v1new/datastoreio.py          |  11 +-
 .../io/gcp/datastore/v1new/datastoreio_test.py     |  16 +
 .../apache_beam/io/gcp/datastore/v1new/types.py    |  25 +-
 .../io/gcp/datastore/v1new/types_test.py           |  13 +-
 .../runners/direct/transform_evaluator.py          |  62 +--
 .../interactive/display/pipeline_graph_renderer.py |   2 +-
 .../apache_beam/runners/worker/log_handler.py      |  26 +-
 .../apache_beam/runners/worker/sdk_worker_main.py  |  25 +-
 website/_config.yml                                |   2 +-
 website/src/.htaccess                              |   2 +-
 website/src/_data/authors.yml                      |   3 +
 website/src/_posts/2019-07-31-beam-2.14.0.md       | 106 ++++
 website/src/get-started/downloads.md               |   7 +
 44 files changed, 1152 insertions(+), 413 deletions(-)
 create mode 100755 release/src/main/scripts/script.config
 create mode 100644 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/NodeStats.java
 create mode 100644 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/NodeStatsMetadata.java
 create mode 100644 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/RelMdNodeStats.java
 create mode 100644 
sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/impl/planner/NodeStatsTest.java
 create mode 100644 website/src/_posts/2019-07-31-beam-2.14.0.md

[beam] branch spark-runner_structured-streaming updated (d77d229 -> 6bf5a93)

Reply via email to