zentol commented on PR #1:
URL: 
https://github.com/apache/flink-connector-mongodb/pull/1#issuecomment-1422714527

   I pushed a few commits to get the PR into a mergable state.
   
   1) I removed the mockito dependency from the sample splitter test. Mockito 
usage is heavily discouraged (see Flink code contribution guide), and instead I 
slightly refactored the code to make this work.
   2) Splitter "implementations" are now just static methods since singletons 
don't provide a benefit.
   3) TestLoggerExtension service entry was removed, because it is implicitly 
activated by pulling in flink-test-utils-junit.
   
   4) Refactored the MongoSampleSplitterTest and fixed some issues with the 
sampling.
   
   The current test was too much of a re-implementation of the internal logic, 
which was highlighted by it not detecting problems in how the sampling and 
merging was done.
   
   The current code used `N = numPartitions * numSamplesPerPartition` samples. 
Seems correct on first glance, but these `N` samples actually split the input 
into `N + 1` partitions.
   
   ```
   Samples:           0 1 2 3   (4)
   Implicit bounds: -         + (min/max)
   Partitions:      |-|-|-|-|-| (5)
   ```
   
   This resulted in an uneven distribution of data across splits.
   
   For example, if we want to create 2 partitions, with 2 samples per 
partition, then we'd like to get this:
   
   (- == Min ; + == Max ; x == Sample #x)
   
   ```
   Samples:    - 0 1 2 +
   Partitions: |---|---|
   ```
   
   Meanwhile, the current code did this:
   ```
   
   Samples:    - 1 2 +
   Partitions: |-|---|
   ```
   
   The replacement of the first/last sampling with Min/Max was removed because 
it made it very difficult to reason about the code, as we actually changed the 
number of partitions.
   
   Let's say we want 2 partitions with 2 samples each => 4 samples.
   
   Before replacement:
   
   ```
   Samples:           0 1 2 3   (4)
   Implicit bounds: -         + (min/max)
   Partitions:      |-|-|-|-|-| (5)
   ```
   
   
   After replacement:
   ```
   
   Samples:             1 2     (2)
   Implicit bounds: -         + (min/max)
   Partitions:      |---|-|---| (3)
   ```
   
   
   Instead we now ask for N-1 samples, and use min/max as additional points:
   
   ```
   Samples:           0 1 2   (3)
   Implicit bounds: -         + (min/max)
   Partitions:      |-|-|-|-| (4)
   Splits:          |---|---|
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to