bersprockets opened a new pull request #35090:
URL: https://github.com/apache/spark/pull/35090


   ### What changes were proposed in this pull request?
   
   When deserializing an Orc struct, reuse the result row when possible.
   
   When the struct's parent is itself a struct, it is safe to reuse the result 
row, since some code (an ancestor of the parent, or some downstream processing) 
has already copied the parent's result row, and therefore our result row.
   
   When the parent is a map or array, it is not safe to reuse the result row, 
because each element of the parent will point to the same result row. In this 
case, we must copy the result row before updating the parent.
   
   Note, it is safe to reuse the result row even when the parent is a struct 
but the grandparent is a map or array, because the parent will copy its result 
row for each element of the grandparent (and thus copy our result row).
   
   ### Why are the changes needed?
   This change speeds up deserialization of records containing Orc structs.
   
   | struct size | master | pr    | improvement  |
   | ----------- | ------ | ----- | ---------    |
   | 10 fields   | 1463   | 296   | 4.94 times
   | 100 fields  | 11818  | 2391  | 4.94 times |
   | 300 fields  | 38938  | 8061  | 4.8 times |
   | 600 fields  | 86735  | 30505 | 2.8 times |
   
   The benchmark results for the PR are included in the PR. The benchmark 
results for master (with the new benchmark added) are 
[here](https://issues.apache.org/jira/secure/attachment/13038214/master_results.txt).
   
   
   
   ### Does this PR introduce _any_ user-facing change?
    No.
   
   
   ### How was this patch tested?
   
   - New unit tests to check correctness (since sometimes the result row is 
reused, and sometimes it is not).
   - New benchmarks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to