Re: Jeff Dean on fast response in an unreliable world

Ian Holsman Wed, 12 Sep 2012 15:54:09 -0700

Hi Ted.

While you might not be able to use this to predict the query itself, my 
thoughts were to use this at the file/block level to try and have more copies 
of the data that the queries would be processing, similar to how traditional 
RDBMS cache tables. It might be a better discussion to have on the Hadoop list 
though, as it wouldn't be specific to Drill.


Regards
Ian




On Sep 13, 2012, at 2:32 AM, Ted Dunning <[email protected]> wrote:

> I would like to point out that neither of these options (both good) would
> affect query processing because replication is far too slow to help at
> query time.
> 
> In another life, I found that we could predict popularity of video items
> using only the very early life history of the items.  Similarly, I have had
> good success predicting first weekend and total life-time revenue for a
> movie based on the first 3 hours on opening night.  These are very
> different domains, but I would think that data assets might be subject to
> the same flash crowd effects and thus be somewhat predictable given early
> interest.
> 
> Seasonality and similar effects are also clearly visible in real customers.
> For instance, it is common for traffic summaries to be very popular for
> the first week and then have a popularity bump on the month, quarter and
> annual anniversaries.
> 
> On Tue, Sep 11, 2012 at 9:46 PM, Ian Holsman <[email protected]> wrote:
> 
>> I don't know of any papers off hand, but I would think you could go down
>> two routes. A predictive trend algo to 'guess' which blocks could get hot
>> based on seasonal traffic and a reactive one based on response time
>> regularized by #replicas it is on.
>> 
>> Sent from my iPhone
>> 
>> On 12/09/2012, at 2:21 PM, Worthy LaFollette <[email protected]> wrote:
>> 
>>> As Ian explained down thread, the paper gave two examples.  The first was
>>> static seeding of duplicates, the second was dynamic with a suggestion
>> of a
>>> monitor which seeds additional copies based on some algorithm in response
>>> to "hot" queries (China being the topic of the example given).  I am
>>> curious if anyone was aware of any papers about this second part.  I can
>>> almost see a cost model where the query measures the overall cost of a
>>> query (latency, risk of latency?) and then generates copies in response.
>>> Part of this of course would be a recovery mechanism which removes these
>>> extra copies.
>>> 
>>> W-
>>> 
>>> On Tue, Sep 11, 2012 at 9:31 PM, Ted Dunning <[email protected]>
>> wrote:
>>> 
>>>> What do you mean be selective replication?
>>>> 
>>>> On Tue, Sep 11, 2012 at 7:23 PM, Worthy LaFollette <[email protected]
>>>>> wrote:
>>>> 
>>>>> Very good paper. Am curious now to the strategies for selective
>>>>> replication, which looks if done right would make the query generation
>>>> more
>>>>> efficient.  Do you know of any papers on that subject?
>>>>> 
>>>>> On Tue, Sep 11, 2012 at 1:37 PM, Ted Dunning <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Headed into Thursday's meetup, this paper by Jeff Dean provides a very
>>>>> good
>>>>>> description of strategies for getting fast response times with
>> variable
>>>>>> quality infrastructure.
>>>>>> 
>>>>>> http://research.google.com/people/jeff/latency.html
>>>>>> 
>>>>>> The key point here is that it is very important to have asynchronous
>>>>>> queries with a cancel.  Above that level, there needs to be a simple
>>>>>> strategy for pushing second versions of queries out to the workers and
>>>>>> canceling defunct or redundant queries.
>>>> 
>> 

--
Ian Holsman
[email protected]
http://doitwithdata.com.au
PH: +61-400-988-964 Skype:iholsman

Re: Jeff Dean on fast response in an unreliable world

Reply via email to