Github user SlavikBaranov commented on a diff in the pull request:
https://github.com/apache/spark/pull/6763#discussion_r32276177
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
@@ -278,7 +279,7 @@ object OpenHashSet {
val INVALID_POS = -1
val NONEXISTENCE_MASK = 0x80000000
- val POSITION_MASK = 0xEFFFFFF
+ val POSITION_MASK = 0x1FFFFFFF
--- End diff --
It's easy to make it support 2^30 capacity, but support of 2^31 will
require some hacks. In JDK8 maximum array size is 2^31 - 1, so we'd need to
store the item with hashCode 2^31 - 1 somewhere else. It will require
additional check that will probably affect performance.
As I remember, in JDK6 max array size is either 2^31 - 4 or 2^31 - 5, so
JDK6 support will require some additional work.
I see following possibilities:
1. Leave the fix as is
2. Update it to support capacity 2^30
3. Make it support 2^31 with some hacks
4. Make it support even larger capacity by splitting value storage into
several arrays.
IMO, second option is most reasonable, since 1B max capacity is definitely
better than 500M. :)
On the other hand, options 3 & 4 look like an overkill: due to distributed
nature of Spark, it's usually not necessary to collect more than a billion
items on a single machine even when working with multi-billion datasets.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]