marin-ma commented on issue #11397: URL: https://github.com/apache/incubator-gluten/issues/11397#issuecomment-3791805106
In spark hive partition write, sort is added in file `V1Writes` to sort records by the partition key before writing. If the next record has a different partition key, the current writer is closed and a new writer is created. Therefore, only one writer is alive at a time during the write. In Gluten, this sort is removed by rule `RemoveNativeWriteFilesSortAndProject`. Because Velox writer creates a new writer for each partition key, and holds all writers in a vector until the write task ends. In this case, there's no benefit to sort the data, but holding too many `Writer` instances in each task can consume too much memory and may result in OOM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
