Re: [PR] [WIP][Proposal] PARQUET-2430: Add parquet joiner [parquet-mr]

via GitHub Sat, 13 Apr 2024 15:00:54 -0700


MaxNevermind commented on PR #1273:
URL: https://github.com/apache/parquet-mr/pull/1273#issuecomment-2053772198


   @wgtmac @ConeyLiu 
   
   fyi
   Wanted to share an idea: we can use binary copy on the right side, do not do 
read-write for right columns, that is possible if the number of rows and 
ordering in row groups on the left and on the right is the same. I believe a 
related ideas were discussed in this PR. I originally doubt possibility of user 
being able to produce files in such a fashion, but this week I’ve found a way 
to do that in Spark 3.3+. I can create a utility class([similar to the one from 
above](https://gist.github.com/MaxNevermind/0feaaf380520ca34c2637027ef349a7d)) 
that takes the input files, run transformations for them, and then write the 
results in a way that it preserves original input files’ names, row groups row 
count and ordering. The benefit are of course the speed because of binary copy 
and simplicity as we don't need RightColumnWriter. I will try to validate my 
idea the next week and get back with the result. If I’m able to do that I 
suggest to simplify this PR or maybe split the effort into 
 two: one simpler version for simple binary copy and the second one based on 
current state of this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][Proposal] PARQUET-2430: Add parquet joiner [parquet-mr]

Reply via email to