Github user SlavikBaranov commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6763#discussion_r32276177
  
    --- Diff: 
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala ---
    @@ -278,7 +279,7 @@ object OpenHashSet {
     
       val INVALID_POS = -1
       val NONEXISTENCE_MASK = 0x80000000
    -  val POSITION_MASK = 0xEFFFFFF
    +  val POSITION_MASK = 0x1FFFFFFF
    --- End diff --
    
    It's easy to make it support 2^30 capacity, but support of 2^31 will 
require some hacks. In JDK8 maximum array size is 2^31 - 1, so we'd need to 
store the item with hashCode 2^31 - 1 somewhere else. It will require 
additional check that will probably affect performance.
    As I remember, in JDK6 max array size is either 2^31 - 4 or 2^31 - 5, so 
JDK6 support will require some additional work.
    
    I see following possibilities:
     1. Leave the fix as is
     2. Update it to support capacity 2^30
     3. Make it support 2^31 with some hacks
     4. Make it support even larger capacity by splitting value storage into 
several arrays.
    
    IMO, second option is most reasonable, since 1B max capacity is definitely 
better than 500M. :)
    On the other hand, options 3 & 4 look like an overkill: due to distributed 
nature of Spark, it's usually not necessary to collect more than a billion 
items on a single machine even when working with multi-billion datasets.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to