This is an automated email from the ASF dual-hosted git repository.
echauchot pushed a change to branch spark-runner_structured-streaming
in repository https://gitbox.apache.org/repos/asf/beam.git.
discard 620a27a Remove Encoders based on kryo now that we call Beam coders in
the runner
discard 824b344 Fix: Remove generic hack of using object. Use actual Coder
encodedType in Encoders
discard 27ef6de Remove unneeded cast
discard 6a27839 Use beam encoders also in the output of the source translation
discard 62a87b6 Apply spotless, fix typo and javadoc
discard c5e78a0 Apply new Encoders to Pardo. Replace Tuple2Coder with
MultiOutputCoder to deal with multiple output to use in Spark Encoder for
DoFnRunner
discard 039f58a Apply new Encoders to GroupByKey
discard 21accab Create a Tuple2Coder to encode scala tuple2
discard 29f7e93 Apply new Encoders to AggregatorCombiner
discard 7f1060a Apply new Encoders to Window assign translation
discard c48d032 Ignore long time failing test: SparkMetricsSinkTest
discard 68d3d67 Improve performance of source: the mapper already calls
windowedValueCoder.decode, no need to call it also in the Spark encoder
discard 3cc256e Apply new Encoders to Read source
discard 7d456b4 Apply new Encoders to CombinePerKey
discard c33fdda Catch Exception instead of IOException because some coders to
not throw Exceptions at all (e.g.VoidCoder)
discard c8bfcf3 Put Encoders expressions serializable
discard 72c267c Wrap exceptions in UserCoderExceptions
discard c6f2ac9 Apply spotless and checkstyle and add javadocs
discard 78b2d22 Add an assert of equality in the encoders test
discard 34e8aa8 Fix generated code: uniform exceptions catching, fix
parenthesis and variable declarations
discard f48067b Fix equal and hashcode
discard ca01777 Remove example code
discard 50060a8 Remove lazy init of beam coder because there is no generic
way on instanciating a beam coder
discard 0cf2c87 Fix beam coder lazy init using reflexion: use .clas + try
catch + cast
discard d7c9a4a Fix getting the output value in code generation
discard 8b07ec8 Fix ExpressionEncoder generated code: typos, try catch, fqcn
discard fdba22d Fix warning in coder construction by reflexion
discard e6b68a8 Lazy init coder because coder instance cannot be interpolated
by catalyst
discard d5645ff Fix code generation in Beam coder wrapper
discard e4478ff Add a simple spark native test to test Beam coders wrapping
into Spark Encoders
discard 2aaf07a Conform to spark ExpressionEncoders: pass classTags,
implement scala Product, pass children from within the ExpressionEncoder, fix
visibilities
discard fff5092 Fix scala Product in Encoders to avoid StackEverflow
discard c9e3534 type erasure: spark encoders require a Class<T>, pass Object
and cast to Class<T>
discard 5fa6331 Wrap Beam Coders into Spark Encoders using ExpressionEncoder:
deserialization part
discard a5c7da3 Wrap Beam Coders into Spark Encoders using ExpressionEncoder:
serialization part
discard 20d5bbd Use "sparkMaster" in local mode to obtain number of shuffle
partitions + spotless apply
discard 22d6466 Improve Pardo translation performance: avoid calling a filter
transform when there is only one output tag
discard 93d425a After testing performance and correctness, launch pipeline
with dataset.foreach(). Make both test mode and production mode use foreach for
uniformity. Move dataset print as a utility method
discard 61f487f Remove no more needed AggregatorCombinerPerKey (there is only
AggregatorCombiner)
discard f8a5046 fixup! Add PipelineResults to Spark structured streaming.
discard 8aafd50 Print number of leaf datasets
discard ec43374 Add spark execution plans extended debug messages.
discard 3b15128 Update log4j configuration
discard cde225a Add PipelineResults to Spark structured streaming.
discard 0e36b19 Make spotless happy
discard 4aaf456 Added metrics sinks and tests
discard dc939c8 Persist all output Dataset if there are multiple outputs in
pipeline Enabled Use*Metrics tests
discard dab3c2e Add a test to check that CombineGlobally preserves windowing
discard 6e9ccdd Fix accumulators initialization in Combine that prevented
CombineGlobally to work.
discard a797884 Fix javadoc
discard 476bc20 Add setEnableSparkMetricSinks() method
discard 51ca79a Add missing dependencies to run Spark Structured Streaming
Runner on Nexmark
discard 5ed3e03 Add metrics support in DoFn
discard d29c64e Ignore for now not working test testCombineGlobally
discard ff50ccb Add a test that combine per key preserves windowing
discard 6638522 Clean groupByKeyTest
discard 8c499a5 add comment in combine globally test
discard f68ed7a Fixed immutable list bug
discard 9601649 Fix javadoc of AggregatorCombiner
discard 7784f30 Clean not more needed WindowingHelpers
discard 3bd95df Clean not more needed RowHelpers
discard 4b5af9c Clean no more needed KVHelpers
discard 569a9cb Now that there is only Combine.PerKey translation, make only
one Aggregator
discard ad0c179 Remove CombineGlobally translation because it is less
performant than the beam sdk one (key + combinePerKey.withHotkeyFanout)
discard 7b6c914 Remove the mapPartition that adds a key per partition because
otherwise spark will reduce values per key instead of globally
discard 1589877 Fix bug in the window merging logic
discard 4e34632 Fix wrong encoder in combineGlobally GBK
discard f00e7d5 Fix case when a window does not merge into any other window
discard f36e5c3 Apply a groupByKey avoids for some reason that the spark
structured streaming fmwk casts data to Row which makes it impossible to
deserialize without the coder shipped into the data. For performance reasons
(avoid memory consumption and having to deserialize), we do not ship coder +
data. Also add a mapparitions before GBK to avoid shuffling
discard 4f4744a Revert extractKey while combinePerKey is not done (so that it
compiles)
discard 6de9acf Fix encoder in combine call
discard 70e3d66 Implement merge accumulators part of CombineGlobally
translation with windowing
discard 28ba572 Output data after combine
discard 960d245 Implement reduce part of CombineGlobally translation with
windowing
discard 595d9eb Fix comment about schemas
discard edaa37f Update KVHelpers.extractKey() to deal with WindowedValue and
update GBK and CPK
discard 5dc8c24 Add TODO in Combine translations
discard 14c703f Add a test that GBK preserves windowing
discard 28ee71c Improve visibility of debug messages
discard 47b7132 re-enable reduceFnRunner timers for output
discard fed93da Re-code GroupByKeyTranslatorBatch to conserve windowing
instead of unwindowing/windowing(GlobalWindow): simplify code, use
ReduceFnRunner to merge the windows
discard c23c07e Add comment about checkpoint mark
discard a3e29b4 Update windowAssignTest
discard 68e3ae2 Put back batch/simpleSourceTest.testBoundedSource
discard cc7a52d Consider null object case on RowHelpers, fixes empty side
inputs tests.
discard a94045c Pass transform based doFnSchemaInformation in ParDo
translation
discard 1be0c8a Fixes ParDo not calling setup and not tearing down if
exception on startBundle
discard 2bfd3d5 Limit the number of partitions to make tests go 300% faster
discard c2896b6 Enable batch Validates Runner tests for Structured Streaming
Runner
discard e4ef07b Apply Spotless
discard ab6a879 Update javadoc
discard fdcd346 implement source.stop
discard d2e65df Ignore spark offsets (cf javadoc)
discard d285dd1 Use PAssert in Spark Structured Streaming transform tests
discard 9354339 Rename SparkPipelineResult to
SparkStructuredStreamingPipelineResult This is done to avoid an eventual
collision with the one in SparkRunner. However this cannot happen at this
moment because it is package private, so it is also done for consistency.
discard fc877cd Add SparkStructuredStreamingPipelineOptions and
SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added
to have the new runner rely only on its specific options.
discard 47376e3 Fix logging levels in Spark Structured Streaming translation
discard 03eb450 Fix spotless issues after rebase
discard 58f97b8 Pass doFnSchemaInformation to ParDo batch translation
discard 2628393 Fix non-vendored imports from Spark Streaming Runner classes
discard cb72394 Merge Spark Structured Streaming runner into main Spark
module. Remove spark-structured-streaming module.Rename Runner to
SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar
for Structured Streaming Runner
discard c9a8c8c Fix access level issues, typos and modernize code to Java 8
style
discard 19e5fdf Disable never ending test SimpleSourceTest.testUnboundedSource
discard c68e875 Apply spotless and fix spotbugs warnings
discard 79e85ec Deal with checkpoint and offset based read
discard b7c68bd Continue impl of offsets for streaming source
discard afa6a48 Clean streaming source
discard 8caa982 Clean unneeded 0 arg constructor in batch source
discard 6e94948 Specify checkpointLocation at the pipeline start
discard 4527615 Add source streaming test
discard ce46b9b Add transformators registry in PipelineTranslatorStreaming
discard cb5dffa Add a TODO on spark output modes
discard 4030fb0 Implement first streaming source
discard 81c0bbe Add streaming source initialisation
discard 2ad1f15 And unchecked warning suppression
discard cb1a99c Added TODO comment for ReshuffleTranslatorBatch
discard 530dfb0 Added using CachedSideInputReader
discard d8ee03e Don't use Reshuffle translation
discard c879337 Fix CheckStyle violations
discard d759a19 Added SideInput support
discard 1355ece Fix javadoc
discard 41e6a19 Implement WindowAssignTest
discard bf2af77 Implement WindowAssignTranslatorBatch
discard 383d58d Cleaning
discard 1244549 Fix encoder bug in combinePerkey
discard ac67ada Add explanation about receiving a Row as input in the combiner
discard a2d1975 Use more generic Row instead of GenericRowWithSchema
discard a72afd8 Fix combine. For unknown reason GenericRowWithSchema is used
as input of combine so extract its content to be able to proceed
discard 00acd7d Update test with Long
discard b13839d Fix various type checking issues in Combine.Globally
discard 3c25348 Get back to classes in translators resolution because URNs
cannot translate Combine.Globally
discard 8d0a8b5 Cleaning
discard 0a88819 Add CombineGlobally translation to avoid translating
Combine.perKey as a composite transform based on Combine.PerKey (which uses low
perf GBK)
discard f48c109 Introduce RowHelpers
discard 2a1d74e Add combinePerKey and CombineGlobally tests
discard 684fc4a Fix combiner using KV as input, use binary encoders in place
of accumulatorEncoder and outputEncoder, use helpers, spotless
discard 72d338b Introduce WindowingHelpers (and helpers package) and use it
in Pardo, GBK and CombinePerKey
discard 4d3be61 Improve type checking of Tuple2 encoder
discard 2c465f8 First version of combinePerKey
discard d33508e Extract binary schema creation in a helper class
discard bfc10a2 Fix getSideInputs
discard 4cbdbb8 Generalize the use of SerializablePipelineOptions in place of
(not serializable) PipelineOptions
discard 08b580b Rename SparkDoFnFilterFunction to DoFnFilterFunction for
consistency
discard 674a048 Add a test for the most simple possible Combine
discard 3716673 Added "testTwoPardoInRow"
discard 2482172 Fix for test elements container in GroupByKeyTest
discard bc7fab4 Rename SparkOutputManager for consistency
discard 314d935 Fix kryo issue in GBK translator with a workaround
discard 3ba4bc6 Simplify logic of ParDo translator
discard b7a76d9 Don't use deprecated sideInput.getWindowingStrategyInternal()
discard 6d723b1 Rename SparkSideInputReader class and rename pruneOutput() to
pruneOutputFilteredByTag()
discard 9bee53f Fixed Javadoc error
discard 2f37994 Apply spotless
discard 7c5b778 Fix Encoders: create an Encoder for every manipulated type to
avoid Spark fallback to genericRowWithSchema and cast to allow having Encoders
with generic types such as WindowedValue<T> and get the type checking back
discard 1b244e9 Fail in case of having SideInouts or State/Timers
discard a74d149 Add ComplexSourceTest
discard 21902b6 Remove no more needed putDatasetRaw
discard a8e50ad Port latest changes of ReadSourceTranslatorBatch to
ReadSourceTranslatorStreaming
discard 14368ff Fix type checking with Encoder of WindowedValue<T>
discard fe7fb4e Add comments and TODO to GroupByKeyTranslatorBatch
discard 779e621 Add GroupByKeyTest
discard fe19d6c Clean
discard ae44706 Address minor review notes
discard d4acf25 Add ParDoTest
discard f47bb0a Clean
discard 1d31831 Fix split bug
discard e12226a Remove bundleSize parameter and always use spark default
parallelism
discard defd5e6 Cleaning
discard b4230dc Fix testMode output to comply with new binary schema
discard 84bdf70 Fix errorprone
discard e4d4b4f Comment schema choices
discard 675bf94 Serialize windowedValue to byte[] in source to be able to
specify a binary dataset schema and deserialize windowedValue from Row to get a
dataset<WindowedValue>
discard e0c8fbd First attempt for ParDo primitive implementation
discard fc54404 Add flatten test
discard 3aea53b Enable gradle build scan
discard 3dc4dc3 Enable test mode
discard 047af3e Put all transform translators Serializable
discard 23ca155 Simplify beam reader creation as it created once the source
as already been partitioned
discard e54cbc6 Fix SourceTest
discard 74693b2 Move SourceTest to same package as tested class
discard 00f2e11 Add serialization test
discard 1720e6b Add SerializationDebugger
discard 625056e Fix serialization issues
discard 0e85242 Clean unneeded fields in DatasetReader
discard 4936687 improve readability of options passing to the source
discard 0ffd98d Fix pipeline triggering: use a spark action instead of
writing the dataset
discard 02933bd Refactor SourceTest to a UTest instaed of a main
discard bb830be Checkstyle and Findbugs
discard 037db6e Clean
discard d12cc14 Add empty 0-arg constructor for mock source
discard 9251dcb Add a dummy schema for reader
discard 02458a7 Apply spotless and fix checkstyle
discard 101f6f2 Use new PipelineOptionsSerializationUtils
discard ca5a120 Add missing 0-arg public constructor
discard 2b64bd2 Wire real SourceTransform and not mock and update the test
discard ca5f70c Refactor DatasetSource fields
discard 9d8dd90 Pass Beam Source and PipelineOptions to the spark DataSource
as serialized strings
discard 5b0f9a2 Move Source and translator mocks to a mock package.
discard 08c05d6 Add ReadSourceTranslatorStreaming
discard b617ba4 Clean
discard d6e905b Use raw Encoder<WindowedValue> also in regular
ReadSourceTranslatorBatch
discard 64c8202 Split batch and streaming sources and translators
discard f9ed0dd Run pipeline in batch mode or in streaming mode
discard 4aa321c Move DatasetSourceMock to proper batch mode
discard 44644d3 clean deps
discard cb03179 Use raw WindowedValue so that spark Encoders could work
(temporary)
discard bea28b1 fix mock, wire mock in translators and create a main test.
discard eb2fa49 Add source mocks
discard b57367f Experiment over using spark Catalog to pass in Beam Source
through spark Table
discard 6795e4c Improve type enforcement in ReadSourceTranslator
discard b8cb742 Improve exception flow
discard 3cb2a76 start source instanciation
discard 5701c75 Apply spotless
discard 238e57a update TODO
discard 3ae60a4 Implement read transform
discard b0920e3 Use Iterators.transform() to return Iterable
discard 1298b77 Add primitive GroupByKeyTranslatorBatch implementation
discard 9d721a7 Add Flatten transformation translator
discard 8c1a012 Create Datasets manipulation methods
discard b458c64 Create PCollections manipulation methods
discard d75ac4a Add basic pipeline execution. Refactor translatePipeline() to
return the translationContext on which we can run startPipeline()
discard cc07798 Added SparkRunnerRegistrar
discard cfd3abf Add precise TODO for multiple TransformTranslator per
transform URN
discard 7612d17 Post-pone batch qualifier in all classes names for readability
discard 1c7d381 Add TODOs
discard bf5619d Make codestyle and firebug happy
discard ff06081 apply spotless
discard 6ede65c Move common translation context components to superclass
discard 8875d48d Move SparkTransformOverrides to correct package
discard 83f44e0 Improve javadocs
discard ae2392e Make transform translation clearer: renaming, comments
discard 31af0f1 Refactoring: -move batch/streaming common translation visitor
and utility methods to PipelineTranslator -rename batch dedicated classes to
Batch* to differentiate with their streaming counterparts -Introduce
TranslationContext for common batch/streaming components
discard 8abe116 Initialise BatchTranslationContext
discard 6f1fda8 Organise methods in PipelineTranslator
discard 5990104 Renames: better differenciate pipeline translator for
transform translator
discard a5b9894 Wire node translators with pipeline translator
discard 9fea139 Add nodes translators structure
discard d7df588 Add global pipeline translation structure
discard ee8875a Start pipeline translation
discard 253123c Add SparkPipelineOptions
discard 2558c22 Fix missing dep
discard 7794f02 Add an empty spark-structured-streaming runner project
targeting spark 2.4.0
add 14a7a24 [BEAM-8470] Add an empty spark-structured-streaming runner
project targeting spark 2.4.0
add 0d314d2 [BEAM-8470] Fix missing dep
add 44aa39d [BEAM-8470] Add SparkPipelineOptions
add 8b1f07e [BEAM-8470] Start pipeline translation
add 5b7efb1 [BEAM-8470] Add global pipeline translation structure
add c3828ea [BEAM-8470] Add nodes translators structure
add c4c38b3 [BEAM-8470] Wire node translators with pipeline translator
add 2222e24 [BEAM-8470] Renames: better differenciate pipeline translator
for transform translator
add 1ffbf4e [BEAM-8470] Organise methods in PipelineTranslator
add 1c1bbab [BEAM-8470] Initialise BatchTranslationContext
add b8d4a96 [BEAM-8470] Refactoring: -move batch/streaming common
translation visitor and utility methods to PipelineTranslator -rename batch
dedicated classes to Batch* to differentiate with their streaming counterparts
-Introduce TranslationContext for common batch/streaming components
add fef01b3 [BEAM-8470] Make transform translation clearer: renaming,
comments
add 666c011 [BEAM-8470] Improve javadocs
add 5eeca80 [BEAM-8470] Move SparkTransformOverrides to correct package
add fd888deb [BEAM-8470] Move common translation context components to
superclass
add 4e94975 [BEAM-8470] apply spotless
add 12ed3f5 [BEAM-8470] Make codestyle and firebug happy
add fbc7fbc [BEAM-8470] Add TODOs
add 3f29bff [BEAM-8470] Post-pone batch qualifier in all classes names
for readability
add 79b4541 [BEAM-8470] Add precise TODO for multiple TransformTranslator
per transform URN
add 7c8cd47 [BEAM-8470] Added SparkRunnerRegistrar
add d757185 [BEAM-8470] Add basic pipeline execution. Refactor
translatePipeline() to return the translationContext on which we can run
startPipeline()
add a36bae0 [BEAM-8470] Create PCollections manipulation methods
add e1b7644 [BEAM-8470] Create Datasets manipulation methods
add ad26cec [BEAM-8470] Add Flatten transformation translator
add 129e95f [BEAM-8470] Add primitive GroupByKeyTranslatorBatch
implementation
add 6965842 [BEAM-8470] Use Iterators.transform() to return Iterable
add cce30c9 [BEAM-8470] Implement read transform
add 7901d73 [BEAM-8470] update TODO
add eb7c77e [BEAM-8470] Apply spotless
add cb9bb99 [BEAM-8470] start source instanciation
add 6bb32a5 [BEAM-8470] Improve exception flow
add 1970760 [BEAM-8470] Improve type enforcement in ReadSourceTranslator
add c863dca [BEAM-8470] Experiment over using spark Catalog to pass in
Beam Source through spark Table
add 2a960dd [BEAM-8470] Add source mocks
add be8344e [BEAM-8470] fix mock, wire mock in translators and create a
main test.
add 756eec3 [BEAM-8470] Use raw WindowedValue so that spark Encoders
could work (temporary)
add 4480578 [BEAM-8470] clean deps
add 26b2a91 [BEAM-8470] Move DatasetSourceMock to proper batch mode
add 5385a8f [BEAM-8470] Run pipeline in batch mode or in streaming mode
add 8051fac [BEAM-8470] Split batch and streaming sources and translators
add ee0ea0e [BEAM-8470] Use raw Encoder<WindowedValue> also in regular
ReadSourceTranslatorBatch
add ae75196 [BEAM-8470] Clean
add 4a59b09 [BEAM-8470] Add ReadSourceTranslatorStreaming
add 68f7230 [BEAM-8470] Move Source and translator mocks to a mock
package.
add 3368a9e [BEAM-8470] Pass Beam Source and PipelineOptions to the spark
DataSource as serialized strings
add c569f1b [BEAM-8470] Refactor DatasetSource fields
add db036ca [BEAM-8470] Wire real SourceTransform and not mock and update
the test
add 6cac8b2 [BEAM-8470] Add missing 0-arg public constructor
add dc615f0 [BEAM-8470] Use new PipelineOptionsSerializationUtils
add adfd237 [BEAM-8470] Apply spotless and fix checkstyle
add aeee309 [BEAM-8470] Add a dummy schema for reader
add e9b0488 [BEAM-8470] Add empty 0-arg constructor for mock source
add 9537add [BEAM-8470] Clean
add 8b232df [BEAM-8470] Checkstyle and Findbugs
add 344f69e [BEAM-8470] Refactor SourceTest to a UTest instaed of a main
add 285aab4 [BEAM-8470] Fix pipeline triggering: use a spark action
instead of writing the dataset
add 244459f [BEAM-8470] improve readability of options passing to the
source
add 1102280 [BEAM-8470] Clean unneeded fields in DatasetReader
add b3796a2 [BEAM-8470] Fix serialization issues
add 1dc0352 [BEAM-8470] Add SerializationDebugger
add f5a0ba7 [BEAM-8470] Add serialization test
add 63aa493 [BEAM-8470] Move SourceTest to same package as tested class
add baea877 [BEAM-8470] Fix SourceTest
add bb7db53 [BEAM-8470] Simplify beam reader creation as it created once
the source as already been partitioned
add 833df0e [BEAM-8470] Put all transform translators Serializable
add 3004db7 [BEAM-8470] Enable test mode
add 3dc7d95 [BEAM-8470] Enable gradle build scan
add 068f63d [BEAM-8470] Add flatten test
add 5dc598a [BEAM-8470] First attempt for ParDo primitive implementation
add 30cbbf4 [BEAM-8470] Serialize windowedValue to byte[] in source to be
able to specify a binary dataset schema and deserialize windowedValue from Row
to get a dataset<WindowedValue>
add 176342f [BEAM-8470] Comment schema choices
add 730eed3 [BEAM-8470] Fix errorprone
add 067e756 [BEAM-8470] Fix testMode output to comply with new binary
schema
add 1960559 [BEAM-8470] Cleaning
add 7e5399e [BEAM-8470] Remove bundleSize parameter and always use spark
default parallelism
add e63d794 [BEAM-8470] Fix split bug
add fc2239d [BEAM-8470] Clean
add f152deb [BEAM-8470] Add ParDoTest
add 7fd2de1 [BEAM-8470] Address minor review notes
add bd385ff [BEAM-8470] Clean
add bdeb934 [BEAM-8470] Add GroupByKeyTest
add 990e220 [BEAM-8470] Add comments and TODO to GroupByKeyTranslatorBatch
add 997043d [BEAM-8470] Fix type checking with Encoder of WindowedValue<T>
add 61bf40c [BEAM-8470] Port latest changes of ReadSourceTranslatorBatch
to ReadSourceTranslatorStreaming
add c8f2078 [BEAM-8470] Remove no more needed putDatasetRaw
add f7a008c [BEAM-8470] Add ComplexSourceTest
add 728dc73 [BEAM-8470] Fail in case of having SideInouts or State/Timers
add 0cd30d1 [BEAM-8470] Fix Encoders: create an Encoder for every
manipulated type to avoid Spark fallback to genericRowWithSchema and cast to
allow having Encoders with generic types such as WindowedValue<T> and get the
type checking back
add 2bea926 [BEAM-8470] Apply spotless
add a048341 [BEAM-8470] Fixed Javadoc error
add 8286afc [BEAM-8470] Rename SparkSideInputReader class and rename
pruneOutput() to pruneOutputFilteredByTag()
add 313bc66 [BEAM-8470] Don't use deprecated
sideInput.getWindowingStrategyInternal()
add 6657d7d [BEAM-8470] Simplify logic of ParDo translator
add 359a6b0 [BEAM-8470] Fix kryo issue in GBK translator with a workaround
add b7ec102 [BEAM-8470] Rename SparkOutputManager for consistency
add 9275c82 [BEAM-8470] Fix for test elements container in GroupByKeyTest
add b136699 [BEAM-8470] Added "testTwoPardoInRow"
add a9eed5b [BEAM-8470] Add a test for the most simple possible Combine
add 222ce45 [BEAM-8470] Rename SparkDoFnFilterFunction to
DoFnFilterFunction for consistency
add 65dcd0b [BEAM-8470] Generalize the use of SerializablePipelineOptions
in place of (not serializable) PipelineOptions
add d24bc86 [BEAM-8470] Fix getSideInputs
add 126ee79 [BEAM-8470] Extract binary schema creation in a helper class
add cb666d3 [BEAM-8470] First version of combinePerKey
add e282801 [BEAM-8470] Improve type checking of Tuple2 encoder
add f277be2 [BEAM-8470] Introduce WindowingHelpers (and helpers package)
and use it in Pardo, GBK and CombinePerKey
add 35f7128 [BEAM-8470] Fix combiner using KV as input, use binary
encoders in place of accumulatorEncoder and outputEncoder, use helpers, spotless
add ca25210 [BEAM-8470] Add combinePerKey and CombineGlobally tests
add f416e20 [BEAM-8470] Introduce RowHelpers
add 4f670a5 [BEAM-8470] Add CombineGlobally translation to avoid
translating Combine.perKey as a composite transform based on Combine.PerKey
(which uses low perf GBK)
add 071aab5 [BEAM-8470] Cleaning
add 2aff552 [BEAM-8470] Get back to classes in translators resolution
because URNs cannot translate Combine.Globally
add cd19577 [BEAM-8470] Fix various type checking issues in
Combine.Globally
add 3499784 [BEAM-8470] Update test with Long
add f94f5ca [BEAM-8470] Fix combine. For unknown reason
GenericRowWithSchema is used as input of combine so extract its content to be
able to proceed
add 4bb19ce [BEAM-8470] Use more generic Row instead of
GenericRowWithSchema
add c3b4686 [BEAM-8470] Add explanation about receiving a Row as input in
the combiner
add 82131f5 [BEAM-8470] Fix encoder bug in combinePerkey
add a781ec1 [BEAM-8470] Cleaning
add 0c53291 [BEAM-8470] Implement WindowAssignTranslatorBatch
add 75dc161 [BEAM-8470] Implement WindowAssignTest
add cb0e79a [BEAM-8470] Fix javadoc
add 95564cd [BEAM-8470] Added SideInput support
add d51e934 [BEAM-8470] Fix CheckStyle violations
add 0f1c9ff [BEAM-8470] Don't use Reshuffle translation
add c4b8e86 [BEAM-8470] Added using CachedSideInputReader
add 4c10a48 [BEAM-8470] Added TODO comment for ReshuffleTranslatorBatch
add ff2ed77 [BEAM-8470] And unchecked warning suppression
add f5cbbda [BEAM-8470] Add streaming source initialisation
add f20a878 [BEAM-8470] Implement first streaming source
add 0863bf5 [BEAM-8470] Add a TODO on spark output modes
add 2470f3e [BEAM-8470] Add transformators registry in
PipelineTranslatorStreaming
add 4ab7b03 [BEAM-8470] Add source streaming test
add 9f2caa8 [BEAM-8470] Specify checkpointLocation at the pipeline start
add 79f22a8 [BEAM-8470] Clean unneeded 0 arg constructor in batch source
add 850f56a [BEAM-8470] Clean streaming source
add e260811 [BEAM-8470] Continue impl of offsets for streaming source
add 45cd090 [BEAM-8470] Deal with checkpoint and offset based read
add 2070278 [BEAM-8470] Apply spotless and fix spotbugs warnings
add 52e8689 [BEAM-8470] Disable never ending test
SimpleSourceTest.testUnboundedSource
add 673731f [BEAM-8470] Fix access level issues, typos and modernize code
to Java 8 style
add f106a04 [BEAM-8470] Merge Spark Structured Streaming runner into main
Spark module. Remove spark-structured-streaming module.Rename Runner to
SparkStructuredStreamingRunner. Remove specific PipelineOotions and Registrar
for Structured Streaming Runner
add 1c82b5e [BEAM-8470] Fix non-vendored imports from Spark Streaming
Runner classes
add 53acbda [BEAM-8470] Pass doFnSchemaInformation to ParDo batch
translation
add 102f6fc [BEAM-8470] Fix spotless issues after rebase
add fecc40b [BEAM-8470] Fix logging levels in Spark Structured Streaming
translation
add f2dd748 [BEAM-8470] Add SparkStructuredStreamingPipelineOptions and
SparkCommonPipelineOptions - SparkStructuredStreamingPipelineOptions was added
to have the new runner rely only on its specific options.
add 5e4d878 [BEAM-8470] Rename SparkPipelineResult to
SparkStructuredStreamingPipelineResult This is done to avoid an eventual
collision with the one in SparkRunner. However this cannot happen at this
moment because it is package private, so it is also done for consistency.
add c9b3e51 [BEAM-8470] Use PAssert in Spark Structured Streaming
transform tests
add cbe80d4 [BEAM-8470] Ignore spark offsets (cf javadoc)
add 2427002 [BEAM-8470] implement source.stop
add 0031ef9 [BEAM-8470] Update javadoc
add 0c082b89 [BEAM-8470] Apply Spotless
add 9b7986a [BEAM-8470] Enable batch Validates Runner tests for
Structured Streaming Runner
add 7891116 [BEAM-8470] Limit the number of partitions to make tests go
300% faster
add 48eaf9d [BEAM-8470] Fixes ParDo not calling setup and not tearing
down if exception on startBundle
add 386080c [BEAM-8470] Pass transform based doFnSchemaInformation in
ParDo translation
add f780659 [BEAM-8470] Consider null object case on RowHelpers, fixes
empty side inputs tests.
add 355e95d [BEAM-8470] Put back batch/simpleSourceTest.testBoundedSource
add 2c49efc [BEAM-8470] Update windowAssignTest
add 0705e50 [BEAM-8470] Add comment about checkpoint mark
add 211cb98 [BEAM-8470] Re-code GroupByKeyTranslatorBatch to conserve
windowing instead of unwindowing/windowing(GlobalWindow): simplify code, use
ReduceFnRunner to merge the windows
add d52ef97 [BEAM-8470] re-enable reduceFnRunner timers for output
add 26fcb28 [BEAM-8470] Improve visibility of debug messages
add c5970f2 [BEAM-8470] Add a test that GBK preserves windowing
add 22abb05 [BEAM-8470] Add TODO in Combine translations
add dc4dda7 [BEAM-8470] Update KVHelpers.extractKey() to deal with
WindowedValue and update GBK and CPK
add 09bc7a1 [BEAM-8470] Fix comment about schemas
add 4c2585b [BEAM-8470] Implement reduce part of CombineGlobally
translation with windowing
add 7222d49 [BEAM-8470] Output data after combine
add a8ca39f [BEAM-8470] Implement merge accumulators part of
CombineGlobally translation with windowing
add 0bdc83c [BEAM-8470] Fix encoder in combine call
add 7237dca [BEAM-8470] Revert extractKey while combinePerKey is not done
(so that it compiles)
add a460052 [BEAM-8470] Apply a groupByKey avoids for some reason that
the spark structured streaming fmwk casts data to Row which makes it impossible
to deserialize without the coder shipped into the data. For performance reasons
(avoid memory consumption and having to deserialize), we do not ship coder +
data. Also add a mapparitions before GBK to avoid shuffling
add 26f778e [BEAM-8470] Fix case when a window does not merge into any
other window
add 6f47514 [BEAM-8470] Fix wrong encoder in combineGlobally GBK
add d093cad [BEAM-8470] Fix bug in the window merging logic
add 5427856 [BEAM-8470] Remove the mapPartition that adds a key per
partition because otherwise spark will reduce values per key instead of globally
add c0e52c2 [BEAM-8470] Remove CombineGlobally translation because it is
less performant than the beam sdk one (key + combinePerKey.withHotkeyFanout)
add e7d3283 [BEAM-8470] Now that there is only Combine.PerKey
translation, make only one Aggregator
add 5a7bbf5 [BEAM-8470] Clean no more needed KVHelpers
add 96b147e [BEAM-8470] Clean not more needed RowHelpers
add d6683ad [BEAM-8470] Clean not more needed WindowingHelpers
add b19b792 [BEAM-8470] Fix javadoc of AggregatorCombiner
add c178172 [BEAM-8470] Fixed immutable list bug
add 98bc7a4 [BEAM-8470] add comment in combine globally test
add 85fe2d0 [BEAM-8470] Clean groupByKeyTest
add e205425 [BEAM-8470] Add a test that combine per key preserves
windowing
add b12e878 [BEAM-8470] Ignore for now not working test
testCombineGlobally
add 8215482 [BEAM-8470] Add metrics support in DoFn
add 35fef0f [BEAM-8470] Add missing dependencies to run Spark Structured
Streaming Runner on Nexmark
add dc1abf5 [BEAM-8470] Add setEnableSparkMetricSinks() method
add a7f04ed [BEAM-8470] Fix javadoc
add a6d0e58 [BEAM-8470] Fix accumulators initialization in Combine that
prevented CombineGlobally to work.
add 37dcb9a [BEAM-8470] Add a test to check that CombineGlobally
preserves windowing
add b59d8c5 [BEAM-8470] Persist all output Dataset if there are multiple
outputs in pipeline Enabled Use*Metrics tests
add a004a56 [BEAM-8470] Added metrics sinks and tests
add 745ab6e [BEAM-8470] Make spotless happy
add 3391f8d [BEAM-8470] Add PipelineResults to Spark structured streaming.
add 7d7503a [BEAM-8470] Update log4j configuration
add 8c91c11 [BEAM-8470] Add spark execution plans extended debug messages.
add 7f71572 [BEAM-8470] Print number of leaf datasets
add 65c8457 [BEAM-8470] fixup! Add PipelineResults to Spark structured
streaming.
add f2388dc [BEAM-8470] Remove no more needed AggregatorCombinerPerKey
(there is only AggregatorCombiner)
add 134ee35 [BEAM-8470] After testing performance and correctness, launch
pipeline with dataset.foreach(). Make both test mode and production mode use
foreach for uniformity. Move dataset print as a utility method
add a8e0ad9 [BEAM-8470] Improve Pardo translation performance: avoid
calling a filter transform when there is only one output tag
add c5b8799 [BEAM-8470] Use "sparkMaster" in local mode to obtain number
of shuffle partitions + spotless apply
add 5ce206b [BEAM-8470] Wrap Beam Coders into Spark Encoders using
ExpressionEncoder: serialization part
add 750037f [BEAM-8470] Wrap Beam Coders into Spark Encoders using
ExpressionEncoder: deserialization part
add 69aebb1 [BEAM-8470] type erasure: spark encoders require a Class<T>,
pass Object and cast to Class<T>
add 02a0f01 [BEAM-8470] Fix scala Product in Encoders to avoid
StackEverflow
add a0706d7 [BEAM-8470] Conform to spark ExpressionEncoders: pass
classTags, implement scala Product, pass children from within the
ExpressionEncoder, fix visibilities
add 3a87d37 [BEAM-8470] Add a simple spark native test to test Beam
coders wrapping into Spark Encoders
add cca4034 [BEAM-8470] Fix code generation in Beam coder wrapper
add 8e341a1 [BEAM-8470] Lazy init coder because coder instance cannot be
interpolated by catalyst
add f1a555e [BEAM-8470] Fix warning in coder construction by reflexion
add 482bcf6 [BEAM-8470] Fix ExpressionEncoder generated code: typos, try
catch, fqcn
add 44274ca [BEAM-8470] Fix getting the output value in code generation
add e2a134c [BEAM-8470] Fix beam coder lazy init using reflexion: use
.clas + try catch + cast
add cfff6f7 [BEAM-8470] Remove lazy init of beam coder because there is
no generic way on instanciating a beam coder
add 72ba1ea [BEAM-8470] Remove example code
add 83421c2 [BEAM-8470] Fix equal and hashcode
add dc5b243 [BEAM-8470] Fix generated code: uniform exceptions catching,
fix parenthesis and variable declarations
add 6b31168 [BEAM-8470] Add an assert of equality in the encoders test
add ba33722 [BEAM-8470] Apply spotless and checkstyle and add javadocs
add 4055d11 [BEAM-8470] Wrap exceptions in UserCoderExceptions
add 2daed76 [BEAM-8470] Put Encoders expressions serializable
add e8463ce [BEAM-8470] Catch Exception instead of IOException because
some coders to not throw Exceptions at all (e.g.VoidCoder)
add 5ebc1e9 [BEAM-8470] Apply new Encoders to CombinePerKey
add 7e5b6df [BEAM-8470] Apply new Encoders to Read source
add cb82036 [BEAM-8470] Improve performance of source: the mapper already
calls windowedValueCoder.decode, no need to call it also in the Spark encoder
add 163b157 [BEAM-8470] Ignore long time failing test:
SparkMetricsSinkTest
add a7629eb [BEAM-8470] Apply new Encoders to Window assign translation
add 54fd760 [BEAM-8470] Apply new Encoders to AggregatorCombiner
add f1f2163 [BEAM-8470] Create a Tuple2Coder to encode scala tuple2
add fbda17a [BEAM-8470] Apply new Encoders to GroupByKey
add de92516 [BEAM-8470] Apply new Encoders to Pardo. Replace Tuple2Coder
with MultiOutputCoder to deal with multiple output to use in Spark Encoder for
DoFnRunner
add f180b4d [BEAM-8470] Apply spotless, fix typo and javadoc
add fbb8355 [BEAM-8470] Use beam encoders also in the output of the
source translation
add 2e1ed4b [BEAM-8470] Remove unneeded cast
add 9e7f118 [BEAM-8470] Fix: Remove generic hack of using object. Use
actual Coder encodedType in Encoders
add a0e6ca4 [BEAM-8470] Remove Encoders based on kryo now that we call
Beam coders in the runner
This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version. This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:
* -- * -- B -- O -- O -- O (620a27a)
\
N -- N -- N refs/heads/spark-runner_structured-streaming
(a0e6ca4)
You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.
Any revisions marked "omit" are not gone; other references still
refer to them. Any revisions marked "discard" are gone forever.
No new revisions were added by this update.
Summary of changes: