MaxNevermind commented on PR #1273: URL: https://github.com/apache/parquet-mr/pull/1273#issuecomment-2053772198
@wgtmac @ConeyLiu fyi Wanted to share an idea: we can use binary copy on the right side, do not do read-write for right columns, that is possible if the number of rows and ordering in row groups on the left and on the right is the same. I believe a related ideas were discussed in this PR. I originally doubt possibility of user being able to produce files in such a fashion, but this week I’ve found a way to do that in Spark 3.3+. I can create a utility class([similar to the one from above](https://gist.github.com/MaxNevermind/0feaaf380520ca34c2637027ef349a7d)) that takes the input files, run transformations for them, and then write the results in a way that it preserves original input files’ names, row groups row count and ordering. The benefit are of course the speed because of binary copy and simplicity as we don't need RightColumnWriter. I will try to validate my idea the next week and get back with the result. If I’m able to do that I suggest to simplify this PR or maybe split the effort into two: one simpler version for simple binary copy and the second one based on current state of this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
