kaori-seasons opened a new issue, #1919: URL: https://github.com/apache/fury/issues/1919
### Feature Request In the python benchmark test, the time-consuming benchmark of each serialization is as follows:  This is the code location where the performance is relatively high when we use pyflink, which is mainly consumed in pickle encoding and decoding.   At present, our company's stock of historical data is 13 million, and each message is between 60kb and 75kb. After discussions with the pyflink community, it is recommended to use pemja for cross-language calls without using beam. In this way, python's native pickle serialization is very slow 测试方法: - 以天(一个点位每秒一条数据,一天共86400条)为单位,进行不同的数据量测试 - 分别测试 3 个算子、5 个算子和 10 个算子情况下的性能 - 对比都带有 output_type 和不都带 output_type 参数的性能  - 每个算子参数带有 output_type 参数的测试代码:  - 每个算子都不带 output_type 参数的测试代码:  其中 output_type 定义了传输数据每个字段类型,定义方式如下图:  a) 测试三个算子 时长 | 带有 output_type 耗时(秒) | 没有 output_type耗时(秒) | 提升效率 -- | -- | -- | -- 1 天 | 9.456 | 9.973 | 5.18% 3 天 | 14.532 | 18.187 | 20.10% 5 天 | 28.911 | 38.786 | 25.46% 7 天 | 34.397 | 51.691 | 33.46% b) 测试5个算子 时长 | 带有 output_type 耗时(秒) | 没有 output_type耗时(秒) | 提升效率 -- | -- | -- | -- 1 天 | 9.971 | 10.401 | 4.13% 3 天 | 20.308 | 25.744 | 21.12% 5 天 | 30.166 | 40.305 | 25.16% 7 天 | 40.340 | 54.405 | 25.85% c) 测试10个算子 时长 | 带有 output_type 耗时(秒) | 没有 output_type耗时(秒) | 提升效率 -- | -- | -- | -- 1 天 | 11.468 | 12.130 | 5.45% 3 天 | 23.697 | 31.121 | 23.85% 5 天 | 38.015 | 49.508 | 23.21% 7 天 | 48.859 | 65.140 | 24.99% Judging from the test results, explicitly specifying the output_type parameter in PyFlink DataStream can significantly improve serialization performance, especially when the amount of data is large and there are many operators, the improvement effect is more obvious. Using output_type can reduce the overhead of serialization and deserialization, reduce the calculation of type inference, and thus improve performance. But now, obviously we hope that fury can improve this situation. Does @chaokunyang have any good suggestions? ### Is your feature request related to a problem? Please describe _No response_ ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
