While moving the new data source API to InternalRow, I noticed a few odd
things:

   - Spark scans always produce UnsafeRow, but that data is passed around
   as InternalRow with explicit casts.
   - Operators expect InternalRow and nearly all codegen works with
   InternalRow (I’ve tested this with quite a few queries.)
   - Spark uses unchecked casts
   
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L254>
   from InternalRow to UnsafeRow in places, assuming that data will be
   unsafe, even though that isn’t what the type system guarantees.

To me, it looks like the idea was to code SQL operators to the abstract
InternalRow so we can swap out the implementation, but ended up with a
general assumption that rows will always be unsafe. This is the worst of
both options: we can’t actually rely on everything working with InternalRow
but code must still use it, until it is inconvenient and an unchecked cast
gets inserted.

The main question I want to answer is this: *what data format should SQL
use internally?* What was the intent when building catalyst?

The v2 data source API depends on the answer, but I also found that this
introduces a significant performance penalty in Parquet (and probably other
formats). A quick check on one of our tables showed a 6% performance hit
caused by unnecessary copies from InternalRow to UnsafeRow. So if we can
guarantee that all operators should support InternalRow, then there is an
easy performance win that also simplifies the v2 data source API.

rb
​
-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to