cnauroth opened a new pull request, #3722:
URL: https://github.com/apache/hive/pull/3722

   ### What changes were proposed in this pull request?
   
   Improve logging in `DagUtils#localizeResource` to clarify root cause for why 
localizing a resource has failed.
   
   ### Why are the changes needed?
   
   While creating a Tez session, `DagUtils#localizeResource` is responsible for 
copying the client's hive-exec.jar into HDFS (`hive.jar.directory`). This 
process can be triggered from multiple threads concurrently, in which case one 
thread performs the copy while the others wait, polling for arrival of the 
destination file.
   
   If there is an `IOException` during this process, it's assumed that the 
thread attempting the write failed, and all others abort. No information about 
the underlying `IOException` is logged. Instead, the log states "previous 
writer likely failed to write." In some cases though, the `IOException` can 
occur on a polling thread for reasons unrelated to what happened in a writing 
thread. For example, in a production incident, the root cause was really that 
an external process had corrupted the copy of hive-exec.jar in 
`hive.jar.directory`, causing failure of the file length validation check in 
`DagUtils#checkPreExisting`. Since the logs didn't say anything about this, it 
made it much more difficult to troubleshoot.
   
   This patch clarifies the logging by stating that a failure on the writing 
thread is just one possible reason for the error. It also logs the exception 
stack trace to make it easier to find the real root cause. This is a patch I 
ran to help recover from the production incident.
   
   ### Does this PR introduce _any_ user-facing change?
   
   There is no behavior change, but it does change the logging output.
   
   ### How was this patch tested?
   
   This patch was deployed as part of resolving the production incident that I 
mentioned. I was also able to create a reproduction in a test environment by 
externally overwriting the hive-exec.jar in `hive.jar.directory` to simulate 
the production incident.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to