There is already a JIRA for it: https://issues.apache.org/jira/browse/HIVE-887 Please add your thoughts to that JIRA.
Zheng On Mon, Jan 18, 2010 at 10:19 AM, Raghu Murthy <[email protected]> wrote: > Currently, only SELECT * with a partition predicate does not create an MR > job. The client directly streams the files corresponding to the > table/partition. I think there is a JIRA to also do simple projections and > filters in the client. > > For now, if you know which queries are small, you could run MR locally on > the client via by providing the following options before running the query. > > set mapred.job.tracker=local; > set mapred.local.dir=/tmp; > > On 1/18/10 10:10 AM, "Eric Sammer" <[email protected]> wrote: > > > All: > > > > I'm using Hive as a back end for a multilevel aggregation reporting > > system. There is a core fact table that is currently multiple millions > > of rows, but will quickly hit billions. A web service accepts user > > queries, turns them into Hive queries, dynamically builds Hive tables > > containing different aggregations that are commonly requested by front > > end users if they don't exist, and eventually returns the results. This > > web service is async, initially giving the UI a ticket which then must > > poll for results. These "summary" tables, which are usually 1 million or > > less in row count, are then queried instead of the raw fact table. > > > > In many cases, very simple WHERE clauses are applied to the summary > > tables which causes a MR job to be spawned. In cases where we know the > > row count will be low, it would be very nice to avoid the MR overhead > > and simply apply the filter inline and stream the results back > > immediately. By simple, I mean a query containing only equality criteria > > with no nested conditional groups, no GROUP BY clause, and no > > complicated UDFs. The goal would be to reduce the time to begin > > streaming results at the expensive of data locality and the > > centralization of the query execution. > > > > The summary tables generally have a block count of only 3 or so, so the > > degree of parallelism being sacrificed is minimal (compared to true back > > end queries that have no user waiting on results). I know Hive's stated > > goals indicate that low latency isn't in the cards, but I don't consider > > this to be low latency in the RDBMS sense. I'm thinking of reducing the > > start of result streaming from a number of minutes to seconds. > > > > Obviously this is somewhat specialized, but I wanted to get people's > > ideas on a keyword that can indicate this type of execution plan, the > > feasibility within the code base, and the suitability of this being > > within Hive's court. The alternatives of building the results via Hive > > and then having another (albeit simple) query layer to comb through the > > files is cumbersome to maintain. The notion that one could do something > > within Hive is appealing because then, based on a threshold of row > > count, you could resort to doing a MR job if necessary even if you think > > you're querying a small table. Or, maybe Hive can fail with some > > indication that centralized query exceeds the configured threshold. > > > > I wanted to throw this out to the list before filing a feature request > > in JIRA for obvious reasons. > > > > Thoughts? Comments? > > -- > > Eric Sammer > > [email protected] > > http://esammer.blogspot.com > > -- Yours, Zheng
