Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21118
  
    @cloud-fan, let me clarify what I'm getting at here.
    
    It appears that Spark makes at least one copy of data to unsafe when 
reading any Parquet row. If the projection includes partition columns, then 
Spark makes a second copy to unsafe. Two copies of every row read from Parquet 
is a fairly common occurrence, even if the plan doesn't need the data to be 
unsafe.
    
    Most of the operators I've been looking at -- including codegen operators 
-- support `InternalRow`. If we can get rid of **two copies of every row**, 
then we should look into it. It is even more important, if this causes less 
headache for implementers of the V2 API. If it doesn't matter whether an 
implementation produces `InternalRow` or `UnsafeRow` to Spark internals, then 
we don't need an extra trait, `SupportsScanUnsafeRow`, or any code to handle it 
in Spark. Simpler APIs are a good thing.
    
    What we need to find out is:
    1. Is there a strong expectation (i.e., a design requirement) that data 
sources produce `UnsafeRow`? (This is why I'm asking @marmbrus, but I'm not 
sure who the best person to ask is.)
    2. How many places actually do depend on `UnsafeRow` and not `InternalRow`? 
I've found one and you pointed out another. But we could update the joiner 
fairly easily and can probably update other places, too.
    
    This is something we should look into now because it has the potential to 
be a good speed improvement for queries that use nested data (and don't use the 
vectorized read path). In addition, it will only get harder to remove needless 
dependence on `UnsafeRow` later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to