All: I'm using Hive as a back end for a multilevel aggregation reporting system. There is a core fact table that is currently multiple millions of rows, but will quickly hit billions. A web service accepts user queries, turns them into Hive queries, dynamically builds Hive tables containing different aggregations that are commonly requested by front end users if they don't exist, and eventually returns the results. This web service is async, initially giving the UI a ticket which then must poll for results. These "summary" tables, which are usually 1 million or less in row count, are then queried instead of the raw fact table.
In many cases, very simple WHERE clauses are applied to the summary tables which causes a MR job to be spawned. In cases where we know the row count will be low, it would be very nice to avoid the MR overhead and simply apply the filter inline and stream the results back immediately. By simple, I mean a query containing only equality criteria with no nested conditional groups, no GROUP BY clause, and no complicated UDFs. The goal would be to reduce the time to begin streaming results at the expensive of data locality and the centralization of the query execution. The summary tables generally have a block count of only 3 or so, so the degree of parallelism being sacrificed is minimal (compared to true back end queries that have no user waiting on results). I know Hive's stated goals indicate that low latency isn't in the cards, but I don't consider this to be low latency in the RDBMS sense. I'm thinking of reducing the start of result streaming from a number of minutes to seconds. Obviously this is somewhat specialized, but I wanted to get people's ideas on a keyword that can indicate this type of execution plan, the feasibility within the code base, and the suitability of this being within Hive's court. The alternatives of building the results via Hive and then having another (albeit simple) query layer to comb through the files is cumbersome to maintain. The notion that one could do something within Hive is appealing because then, based on a threshold of row count, you could resort to doing a MR job if necessary even if you think you're querying a small table. Or, maybe Hive can fail with some indication that centralized query exceeds the configured threshold. I wanted to throw this out to the list before filing a feature request in JIRA for obvious reasons. Thoughts? Comments? -- Eric Sammer [email protected] http://esammer.blogspot.com
