zentol commented on PR #1: URL: https://github.com/apache/flink-connector-mongodb/pull/1#issuecomment-1422714527
I pushed a few commits to get the PR into a mergable state. 1) I removed the mockito dependency from the sample splitter test. Mockito usage is heavily discouraged (see Flink code contribution guide), and instead I slightly refactored the code to make this work. 2) Splitter "implementations" are now just static methods since singletons don't provide a benefit. 3) TestLoggerExtension service entry was removed, because it is implicitly activated by pulling in flink-test-utils-junit. 4) Refactored the MongoSampleSplitterTest and fixed some issues with the sampling. The current test was too much of a re-implementation of the internal logic, which was highlighted by it not detecting problems in how the sampling and merging was done. The current code used `N = numPartitions * numSamplesPerPartition` samples. Seems correct on first glance, but these `N` samples actually split the input into `N + 1` partitions. ``` Samples: 0 1 2 3 (4) Implicit bounds: - + (min/max) Partitions: |-|-|-|-|-| (5) ``` This resulted in an uneven distribution of data across splits. For example, if we want to create 2 partitions, with 2 samples per partition, then we'd like to get this: (- == Min ; + == Max ; x == Sample #x) ``` Samples: - 0 1 2 + Partitions: |---|---| ``` Meanwhile, the current code did this: ``` Samples: - 1 2 + Partitions: |-|---| ``` The replacement of the first/last sampling with Min/Max was removed because it made it very difficult to reason about the code, as we actually changed the number of partitions. Let's say we want 2 partitions with 2 samples each => 4 samples. Before replacement: ``` Samples: 0 1 2 3 (4) Implicit bounds: - + (min/max) Partitions: |-|-|-|-|-| (5) ``` After replacement: ``` Samples: 1 2 (2) Implicit bounds: - + (min/max) Partitions: |---|-|---| (3) ``` Instead we now ask for N-1 samples, and use min/max as additional points: ``` Samples: 0 1 2 (3) Implicit bounds: - + (min/max) Partitions: |-|-|-|-| (4) Splits: |---|---| ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
