[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

via GitHub Thu, 27 Apr 2023 20:34:23 -0700


puchengy commented on PR #7430:
URL: https://github.com/apache/iceberg/pull/7430#issuecomment-1526936388


   Hi @aokolnychyi, 
   
   > I am not going to oppose a SQL config but I don't think we should rely on 
an internal SQL property for built-in file sources.
   Trying understand your stance here. Do you mean you are fine with the 
current change but against making it correlate to 
"spark.sql.files.maxPartitionBytes"? If so, I am fine with that.
   
   > Can we identify exact scenarios when the default split size performs 
poorly and check if we can solve the underlying problem?
   I can share two scenarios, they doesn't really lead to poor performance, but 
it made our platform team's life harder ("harder" means making migration work 
more challenging).
   
   (1) as mentioned above, when SparkSQL used to consume a Hive table with a 
large "spark.sql.files.maxPartitionBytes" value (for example, 1GB), changing 
the underlying table to Iceberg (default to 128MB split size) will immediately 
increase the split count by 8x (in theory), this will lead to driver memory 
consumption increased and cause job driver OOM. 
   
   (2) we have a strict SLA we customer, this usually mean when we perform a 
change to a SparkSQL job, hopefully we make sure the output are the same 
(number of files and size of each files). In the case of Iceberg migration, 
when source table is changed from Hive to Iceberg, due to the split count 
changes, it will directly increase the SparkSQL job output files by 8x (in 
theory). While we can further make a case that the increase is OK, but this is 
making the surface of work larger thus slower down the innovation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] puchengy commented on pull request #7430: Allow sparksql to override target split size with session property

Reply via email to