dmkozh commented on pull request #13739: URL: https://github.com/apache/beam/pull/13739#issuecomment-781761183
IIUC the elements just get pickled, at least for e.g. DirectRunner (I don't think this should change between runners though). I agree that in theory there is no need to have more than a few bytes. In practice we have the test_stream.WindowedValueHolder which holds WindowedValue, which in turns holds a list of windows (I suppose it can have multiple elements), timestamp (which may have different type) and pane info. Most of the space is taken by the type/field names. I was able to cut some of this, but still got 400+ bytes per record. Ultimately, I lack knowledge/docs on the possible values of WindowedValues to optimize this further - I'm not sure how to come up with a schema for all the windowing information. In the end, I still think it's useful to strip windowing information when it's not needed, even if it was much smaller. Interactive analysis with windowing seems a bit exotic outside of the tutorials and any user not needing windowing could gain a significant performance boost. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
