xy720 opened a new issue #3877: URL: https://github.com/apache/incubator-doris/issues/3877
Currently, broker load support reading parquet file from remote, and soon we will use parquet format as intermediate output in spark load. But due to the seperated metadata (file meta/column meta/page header...) structure of parquet file, broker reader need frequently seek to get data, which leads to a lot of RPCs. Large amount of RPCs will lead to huge network costs in cross-data-center scene. You can see a big gap of time cost in the table below. |cross-center|rpc times|load time|data size| |----|----|----|----| |No|15014|60s|560m| |Yes|16817|2h|560m| |No|169766|8min|5.8G| |Yes|150476|14h|5.8G| ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
