Github user superbobry commented on the issue:
https://github.com/apache/spark/pull/19369
I've looked at the allocation profile of the sample, it does indeed contain
strings from other sources, so the heap histogram is not representative.
However, the allocations from `BlockManagerMasterEndpoint` are gone.
The test failures are due to an invalid test "deterministic node selection"
in the `BlockManagerReplicationBehavior`. The test checks that any
`BlockReplicationPolicy` is deterministic and strictly monotonic, meaning that
the locations for 4x replication of some block would be a strict superset of 3x
and 2x locations. This is not the case for neither
`BasicBlockReplicationPolicy` nor
`RandomBlockReplicationPolicy`. Both call `getSampleIds` and, for them to
be deterministic given a fixed random seed `r`, the result of `getSampleIds(n,
m, r)` should be a subset of `getSampleIds(n, m + 1, r)`. This would be true if
the method implemented a partial Knuth shuffle, which does the same sequence of
`m` `nextInt` calls in both cases. For Floyd algorithm, however, this is false,
because the sequence of calls to `nextInt` depends on `m`. Specifically for
`getSampleIds(n, m, r)` it would start from
```
r.nextInt(n - m + 1)
...
```
while for `getSampleIds(n, m + 1, r)` the first call would be
```
r.nextInt(n - (m + 1) + 1)
...
```
Magically hashing `"test_" + id` resulted in the case where the
monotonicity property is satisfied. The switch to the auto-generated `hashCode`
has broken the magic.
I replaced the implementation of `getSampleIds` with
```scala
private def getSampleIds(n: Int, m: Int, r: Random): List[Int] = {
r.shuffle(List.range(0, n)).take(m)
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]