Allow for output inspection in realtime; perhaps in log files, but somewhere?
-----------------------------------------------------------------------------
Key: HIVE-492
URL: https://issues.apache.org/jira/browse/HIVE-492
Project: Hadoop Hive
Issue Type: Wish
Components: Logging
Reporter: Adam Kramer
Many queries take a long time to complete, and then fail (either because the
job fails or because the output data is not what was desired).
This is almost always traceable to, of course, an error in a mapper or a
reducer, which we can check or verify via multiple methods, most often running
the query piece-by-piece and seeing where the "wrong" output is. This process
is time-consuming and requires a decent amount of load on the system (e.g.,
repeating big queries while trying to debug transformers/syntax). This problem
is a bigger deal when a single query uses multiple transforms and several
mapreduce steps.
One way to potentially reduce the amount of overhead in debugging would be to
provide actual output in some logging mechanism. Specifically, I mean to have
EVERY mapper and/or reducer write the first five lines of output to some
user-readable file. This would allow a user to see what each part of the system
is doing, and to potentially locate, in ONE failed query statement, where the
user error is.
Of course, 5 lines * 20000 mappers * 300 reducers is a lot of overhead; making
this user-configurable and/or estimated beforehand (at least 5 lines from at
least 5 mappers and at least 5 reducers) would be fine, as would making these
output logs auto-delete after some timeframe (a day, perhaps).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.