Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @cloud-fan, let me clarify what I'm getting at here. It appears that Spark makes at least one copy of data to unsafe when reading any Parquet row. If the projection includes partition columns, then Spark makes a second copy to unsafe. Two copies of every row read from Parquet is a fairly common occurrence, even if the plan doesn't need the data to be unsafe. Most of the operators I've been looking at -- including codegen operators -- support `InternalRow`. If we can get rid of **two copies of every row**, then we should look into it. It is even more important, if this causes less headache for implementers of the V2 API. If it doesn't matter whether an implementation produces `InternalRow` or `UnsafeRow` to Spark internals, then we don't need an extra trait, `SupportsScanUnsafeRow`, or any code to handle it in Spark. Simpler APIs are a good thing. What we need to find out is: 1. Is there a strong expectation (i.e., a design requirement) that data sources produce `UnsafeRow`? (This is why I'm asking @marmbrus, but I'm not sure who the best person to ask is.) 2. How many places actually do depend on `UnsafeRow` and not `InternalRow`? I've found one and you pointed out another. But we could update the joiner fairly easily and can probably update other places, too. This is something we should look into now because it has the potential to be a good speed improvement for queries that use nested data (and don't use the vectorized read path). In addition, it will only get harder to remove needless dependence on `UnsafeRow` later.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org