[GitHub] [beam] robertwb commented on pull request #23194: Batch encoding and decoding of schema data.

GitBox Fri, 30 Sep 2022 12:01:30 -0700


robertwb commented on PR #23194:
URL: https://github.com/apache/beam/pull/23194#issuecomment-1263914988


   > I wonder if we should prefer arrow arrays as the default column type 
though? That way we can specialize strings as well. I think this is OK as-is 
though since it's not used anywhere and it's effectively a proof of concept. We 
can experiment with arrow/strings later.
   
   I think numpy is a lower common denominator than arrow, but this framework 
certainly lends itself to registration of column encoders of various types, 
including ragged, justified, or packed arrow string types. (Technically this 
currently works for anything supporting the 
https://docs.python.org/3/c-api/buffer.html , though an explicit lookup on the 
dtype would need to be generalize.)
   
   And, yes, the primary intent is when batch DoFns are adjacent to cross-SDK 
boundaries of some type. 
   
   Currently, I don't think there's room for savings in decoding a batch + 
exploding over decoding individual rows (or vice versa) but the former could be 
a win if we have a row representation that's simply an index into a batch 
(though that might simply push some of the cost to attribute access).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] robertwb commented on pull request #23194: Batch encoding and decoding of schema data.

Reply via email to