[GitHub] [iceberg] HeartSaVioR commented on pull request #1774: [iceberg-1746] Implement spark fanout writer

GitBox Mon, 16 Nov 2020 23:17:07 -0800


HeartSaVioR commented on pull request #1774:
URL: https://github.com/apache/iceberg/pull/1774#issuecomment-728736862



   OK now I understand what "fanout" meant. Didn't know about Flink 
implementation. Thanks.
   
   I see the concern of the overall number of output files, but if I understand 
correctly, using fanout writer would produce the same number of output files - 
this just eliminates the necessity of "local sort" at the cost of multiple 
files opening together for write. For the best result of number of output 
files, we still need to repartition based on partition, regardless of using 
fanout writer.
   
   Another question is, is it better to have the flag on table properties, or 
have the option on Spark Iceberg sink? The actual concern would be predicting 
how many files need to be opened together for write. This would be highly 
depending on the cardinality of partitions for the output, which might depend 
on the characteristic of the outputs, but might be also "query dependent" like 
we consider about batch vs streaming. I'm not maintaining the Iceberg table in 
production scale so can't say. Probably @aokolnychyi would have some insight on 
this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] HeartSaVioR commented on pull request #1774: [iceberg-1746] Implement spark fanout writer

Reply via email to