Allow for output inspection in realtime; perhaps in log files, but somewhere?
-----------------------------------------------------------------------------

                 Key: HIVE-492
                 URL: https://issues.apache.org/jira/browse/HIVE-492
             Project: Hadoop Hive
          Issue Type: Wish
          Components: Logging
            Reporter: Adam Kramer


Many queries take a long time to complete, and then fail (either because the 
job fails or because the output data is not what was desired).

This is almost always traceable to, of course, an error in a mapper or a 
reducer, which we can check or verify via multiple methods, most often running 
the query piece-by-piece and seeing where the "wrong" output is. This process 
is time-consuming and requires a decent amount of load on the system (e.g., 
repeating big queries while trying to debug transformers/syntax). This problem 
is a bigger deal when a single query uses multiple transforms and several 
mapreduce steps.

One way to potentially reduce the amount of overhead in debugging would be to 
provide actual output in some logging mechanism. Specifically, I mean to have 
EVERY mapper and/or reducer write the first five lines of output to some 
user-readable file. This would allow a user to see what each part of the system 
is doing, and to potentially locate, in ONE failed query statement, where the 
user error is.

Of course, 5 lines * 20000 mappers * 300 reducers is a lot of overhead; making 
this user-configurable and/or estimated beforehand (at least 5 lines from at 
least 5 mappers and at least 5 reducers) would be fine, as would making these 
output logs auto-delete after some timeframe (a day, perhaps).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to