On Mon, Jul 30, 2012 at 11:38 PM, Namit Jain <nj...@fb.com> wrote:

> That would be difficult. The % done can be estimated from the data already
> read.
>

I'm confused. Wouldn't the maximum size of the data remaining over the
maximum size of the original query give a reasonable approximation of the
amount of work done?


>
> It might be simpler to have a check like: if the query isn't done in
> the first 5 seconds of running locally, you switch to mapreduce.
>

There are three problems I see:
  * If the query is 95% done at 5 seconds,  it is a shame to kill it and
start over again at 0% on mapreduce with a much longer latency. (Instead of
spending the additional 0.25 seconds you spend an additional 60+.)
  * You can't print anything until you know whether you are going to kill
it or not. (The mapreduce results might come back in a different order....)
With user-facing programs, it is much better to start printing early
instead of later since it gives faster feedback to the user.
  * It isn't predictable how the query will run. That makes it very hard to
build applications on top of Hive.

Do those make sense?

Reply via email to