All:

I'm using Hive as a back end for a multilevel aggregation reporting
system. There is a core fact table that is currently multiple millions
of rows, but will quickly hit billions. A web service accepts user
queries, turns them into Hive queries, dynamically builds Hive tables
containing different aggregations that are commonly requested by front
end users if they don't exist, and eventually returns the results. This
web service is async, initially giving the UI a ticket which then must
poll for results. These "summary" tables, which are usually 1 million or
less in row count, are then queried instead of the raw fact table.

In many cases, very simple WHERE clauses are applied to the summary
tables which causes a MR job to be spawned. In cases where we know the
row count will be low, it would be very nice to avoid the MR overhead
and simply apply the filter inline and stream the results back
immediately. By simple, I mean a query containing only equality criteria
with no nested conditional groups, no GROUP BY clause, and no
complicated UDFs. The goal would be to reduce the time to begin
streaming results at the expensive of data locality and the
centralization of the query execution.

The summary tables generally have a block count of only 3 or so, so the
degree of parallelism being sacrificed is minimal (compared to true back
end queries that have no user waiting on results). I know Hive's stated
goals indicate that low latency isn't in the cards, but I don't consider
this to be low latency in the RDBMS sense. I'm thinking of reducing the
start of result streaming from a number of minutes to seconds.

Obviously this is somewhat specialized, but I wanted to get people's
ideas on a keyword that can indicate this type of execution plan, the
feasibility within the code base, and the suitability of this being
within Hive's court. The alternatives of building the results via Hive
and then having another (albeit simple) query layer to comb through the
files is cumbersome to maintain. The notion that one could do something
within Hive is appealing because then, based on a threshold of row
count, you could resort to doing a MR job if necessary even if you think
you're querying a small table. Or, maybe Hive can fail with some
indication that centralized query exceeds the configured threshold.

I wanted to throw this out to the list before filing a feature request
in JIRA for obvious reasons.

Thoughts? Comments?
-- 
Eric Sammer
[email protected]
http://esammer.blogspot.com

Reply via email to