[GitHub] [incubator-doris] xy720 opened a new issue #3877: [optimize] Optimize spark load/broker load reading parquet format file

GitBox Mon, 15 Jun 2020 10:59:16 -0700


xy720 opened a new issue #3877:
URL: https://github.com/apache/incubator-doris/issues/3877



   Currently, broker load support reading parquet file from remote, and soon we 
will use parquet format as intermediate output in spark load.
   
   But due to the seperated metadata (file meta/column meta/page header...) 
structure of parquet file, broker reader need frequently seek to get data, 
which leads to a lot of RPCs.  Large amount of RPCs will lead to huge network 
costs in cross-data-center scene.
   
   You can see a big gap of time cost in the table below.
   
   |cross-center|rpc times|load time|data size|
   |----|----|----|----|
   |No|15014|60s|560m|
   |Yes|16817|2h|560m|
   |No|169766|8min|5.8G|
   |Yes|150476|14h|5.8G|
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-doris] xy720 opened a new issue #3877: [optimize] Optimize spark load/broker load reading parquet format file

Reply via email to