599166320 commented on issue #15177: URL: https://github.com/apache/druid/issues/15177#issuecomment-1771999731
I reviewed historical issues, and it's possible that #12191, #8959, and this issue are of the same category. The problem becomes particularly severe when ingesting schemaless data, resulting in task failures and data loss. I tried increasing parallelism and reducing the size of individual segments. However, in practice, it's challenging to control. Especially when ingesting schemaless data, each column has different cardinality and size. In one scenario, even if the data written for each column is not substantial under normal circumstances, during exceptional cases where a large amount of stack information needs to be stored, it can lead to an exceptionally large stack column data. If we set the number of rows for each segment very low just to prevent buffer overflows, it results in a high number of segment files and significantly increases metadata, which in turn leads to a sharp decline in server performance and the scheduling performance of the storage cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
