[GitHub] [iceberg] CodingCat commented on pull request #3663: fix data loss in multi-threading append with HadoopTable

GitBox Sun, 05 Dec 2021 17:45:52 -0800


CodingCat commented on pull request #3663:
URL: https://github.com/apache/iceberg/pull/3663#issuecomment-986365646



   Hi, @kbendick and @rdblue , thank you very much for the review! IIUC, 
besides the too-many thread number concern (which I will address with code), 
your comments are both related to the scope of HadoopTable. let me try to 
address them here and love to hear if it makes sense to you
   
   **First, this PR is exactly only for multi threading instead of processes.** 
The reason we care about multi threading operation is that it is much more 
implicit than multiple processes. Users may easily realize there are 1+ Spark 
applications running in the cluster but not necessarily realized there are 
multiple threads appending to the same table from the same Spark application. 
Just like our case which is somewhat an event-driven append case and each event 
is processed by a Spark job triggered from a separate thread. So as @kbendick 
said, it is a best-effort work and as @rdblue said, the multiple processes 
limitation is still there. 
   
   Second, regarding @kbendick 's concern,  **how to prevent it extending to an 
endless effort to make HadoopTable reaching to the capability of other catalog 
implementations, i.e. make HadoopTable eventually support distributed commits** 
. I think it is a valid concern on the resource allocation of community and 
roadmap of the project . My personal opinion is to try to find some way to 
explicitly fail the application (in this PR or the other, depends on the 
workload) when finding multiple processes are committing, not only serving as a 
hard guard but also prevent users didn't read docs and just to lose some data
   
   Third, regarding @rdblue's suggestion about **making HadoopTableOperations 
support distributed committing by leveraging DynamoLockManager** . I am 
hesitated to go with this path for now , please do let me know if I 
misunderstood anything: essentially we are going to make iceberg-core depends 
on iceberg-aws which will bring the circular dependency and also some 
counter-intuitive architecture with a hardcode component from AWS...
   
   Lastly, as I said in the background info of the PR, companies like us have 
been operating in some company's env for a while. There was a period that every 
customer was encouraged to write parquet path instead of registering a table in 
HMS and it has make 100s of our data applications are talking to file system 
directly instead of HMS. The effort here is also about making functionality 
parity with Delta Lake
   
   
   
   Regarding the action items:
   
   1. I will definitely address thread number issue, the retry-limit like 1000 
or thread count are more like conservative numbers, easy to lower down. in 
theory , retry-limit can be <= thread count
   2. if you all agree, I will try to find some way to fail applications 
explicitly when there are distributed commits in HadoopTable...I will share my 
findings and depends on the complexity, we can decide how to move forward
   
   Love to hear your feedbacks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] CodingCat commented on pull request #3663: fix data loss in multi-threading append with HadoopTable

Reply via email to