Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7480#discussion_r35150847
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
    @@ -148,3 +148,81 @@ private[joins] object HashedRelation {
         }
       }
     }
    +
    +
    +/**
    + * A HashedRelation for UnsafeRow, which is backed by BytesToBytesMap that 
maps the key into a
    + * sequence of values.
    + *
    + * TODO(davies): use BytesToBytesMap
    + */
    +private[joins] final class UnsafeHashedRelation(
    +    private var hashTable: JavaHashMap[UnsafeRow, 
CompactBuffer[UnsafeRow]])
    +  extends HashedRelation with Externalizable {
    +
    +  def this() = this(null)  // Needed for serialization
    +
    +  override def get(key: InternalRow): CompactBuffer[InternalRow] = {
    +    val unsafeKey = key.asInstanceOf[UnsafeRow]
    +    // Thanks to type eraser
    --- End diff --
    
    Not to be pedantic, but this is more of a variance issues than an erasure 
issue, right?  I think the problem is that Scala doesn't know that 
`CompactBuffer[UnsafeRow]` is a subtype of `CompactBuffer[InternalRow]`.  Since 
`CompactBuffer` is mutable, I don't think that it's [technically 
safe](https://en.wikipedia.org/wiki/Covariance_and_contravariance_(computer_science)#Arrays)
 to have `CompactBuffer` be non-invariant.
    
    Given our use, though, I think it's totally safe to do the cast here 
because we know that the caller will never try to mutate the returned buffer.
    
    To avoid this cast, one trick that we can use is to declare our `hashTable` 
as `hashTable: JavaHashMap[UnsafeRow, _ <: Seq[UnsafeRow]]`, update 
`HashedRelation.get()` to return a `Seq[InternalRow]`, and update 
`HashJoin.hashJoin()` to declare `currentHashMatches` as `Seq[InternalRow]`.  
This hides the mutability of the buffers from clients of `HashedRelation` and 
allows us to work around the variance issue.
    
    This adds a small amount of complexity, though, and may not be worth 
fixing, but I just wanted to point out the option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to