andreAmorimF commented on PR #55689:
URL: https://github.com/apache/spark/pull/55689#issuecomment-4501395404

   > @andreAmorimF I don't think this should be added as a Spark Connect API 
for a couple of reasons:
   > 
   > * Spark Connect is supposed to be engine agnostic. Leaking execution 
details into the API is not really desirable.
   > * AFAICT the only reason why this would be needed is because you want to 
modify parallelism at some stage in the plan. At the end of the day this should 
be an engine problem, and we should try to fix it there.
   
   Hi @hvanhovell thanks for the reply. I think we do consider the execution 
details in the API already (ex: `repartition`) and i wonder if this API will 
also be removed eventually. So far, a great thing of using spark for me was to 
be able to tune this closer to the client and i don't see current work streams 
to perform the same on the engine side.
   
   Would you be willing to accept contributions in this direction? I think 
giving some heuristics to the engine on how the partitions of the query should 
be estimated via a configuration parameter (ex: MAX_SIZE_OF_INPUTS or 
MIN_SIZE_OF_INPUTS, etc.) could be a start to achieve what I want.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to