CodingCat commented on pull request #3663: URL: https://github.com/apache/iceberg/pull/3663#issuecomment-986365646
Hi, @kbendick and @rdblue , thank you very much for the review! IIUC, besides the too-many thread number concern (which I will address with code), your comments are both related to the scope of HadoopTable. let me try to address them here and love to hear if it makes sense to you **First, this PR is exactly only for multi threading instead of processes.** The reason we care about multi threading operation is that it is much more implicit than multiple processes. Users may easily realize there are 1+ Spark applications running in the cluster but not necessarily realized there are multiple threads appending to the same table from the same Spark application. Just like our case which is somewhat an event-driven append case and each event is processed by a Spark job triggered from a separate thread. So as @kbendick said, it is a best-effort work and as @rdblue said, the multiple processes limitation is still there. Second, regarding @kbendick 's concern, **how to prevent it extending to an endless effort to make HadoopTable reaching to the capability of other catalog implementations, i.e. make HadoopTable eventually support distributed commits** . I think it is a valid concern on the resource allocation of community and roadmap of the project . My personal opinion is to try to find some way to explicitly fail the application (in this PR or the other, depends on the workload) when finding multiple processes are committing, not only serving as a hard guard but also prevent users didn't read docs and just to lose some data Third, regarding @rdblue's suggestion about **making HadoopTableOperations support distributed committing by leveraging DynamoLockManager** . I am hesitated to go with this path for now , please do let me know if I misunderstood anything: essentially we are going to make iceberg-core depends on iceberg-aws which will bring the circular dependency and also some counter-intuitive architecture with a hardcode component from AWS... Lastly, as I said in the background info of the PR, companies like us have been operating in some company's env for a while. There was a period that every customer was encouraged to write parquet path instead of registering a table in HMS and it has make 100s of our data applications are talking to file system directly instead of HMS. The effort here is also about making functionality parity with Delta Lake Regarding the action items: 1. I will definitely address thread number issue, the retry-limit like 1000 or thread count are more like conservative numbers, easy to lower down. in theory , retry-limit can be <= thread count 2. if you all agree, I will try to find some way to fail applications explicitly when there are distributed commits in HadoopTable...I will share my findings and depends on the complexity, we can decide how to move forward Love to hear your feedbacks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
