robertwb commented on PR #23194: URL: https://github.com/apache/beam/pull/23194#issuecomment-1263914988
> I wonder if we should prefer arrow arrays as the default column type though? That way we can specialize strings as well. I think this is OK as-is though since it's not used anywhere and it's effectively a proof of concept. We can experiment with arrow/strings later. I think numpy is a lower common denominator than arrow, but this framework certainly lends itself to registration of column encoders of various types, including ragged, justified, or packed arrow string types. (Technically this currently works for anything supporting the https://docs.python.org/3/c-api/buffer.html , though an explicit lookup on the dtype would need to be generalize.) And, yes, the primary intent is when batch DoFns are adjacent to cross-SDK boundaries of some type. Currently, I don't think there's room for savings in decoding a batch + exploding over decoding individual rows (or vice versa) but the former could be a win if we have a row representation that's simply an index into a batch (though that might simply push some of the cost to attribute access). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
