Hi Ted. While you might not be able to use this to predict the query itself, my thoughts were to use this at the file/block level to try and have more copies of the data that the queries would be processing, similar to how traditional RDBMS cache tables. It might be a better discussion to have on the Hadoop list though, as it wouldn't be specific to Drill.
Regards Ian On Sep 13, 2012, at 2:32 AM, Ted Dunning <[email protected]> wrote: > I would like to point out that neither of these options (both good) would > affect query processing because replication is far too slow to help at > query time. > > In another life, I found that we could predict popularity of video items > using only the very early life history of the items. Similarly, I have had > good success predicting first weekend and total life-time revenue for a > movie based on the first 3 hours on opening night. These are very > different domains, but I would think that data assets might be subject to > the same flash crowd effects and thus be somewhat predictable given early > interest. > > Seasonality and similar effects are also clearly visible in real customers. > For instance, it is common for traffic summaries to be very popular for > the first week and then have a popularity bump on the month, quarter and > annual anniversaries. > > On Tue, Sep 11, 2012 at 9:46 PM, Ian Holsman <[email protected]> wrote: > >> I don't know of any papers off hand, but I would think you could go down >> two routes. A predictive trend algo to 'guess' which blocks could get hot >> based on seasonal traffic and a reactive one based on response time >> regularized by #replicas it is on. >> >> Sent from my iPhone >> >> On 12/09/2012, at 2:21 PM, Worthy LaFollette <[email protected]> wrote: >> >>> As Ian explained down thread, the paper gave two examples. The first was >>> static seeding of duplicates, the second was dynamic with a suggestion >> of a >>> monitor which seeds additional copies based on some algorithm in response >>> to "hot" queries (China being the topic of the example given). I am >>> curious if anyone was aware of any papers about this second part. I can >>> almost see a cost model where the query measures the overall cost of a >>> query (latency, risk of latency?) and then generates copies in response. >>> Part of this of course would be a recovery mechanism which removes these >>> extra copies. >>> >>> W- >>> >>> On Tue, Sep 11, 2012 at 9:31 PM, Ted Dunning <[email protected]> >> wrote: >>> >>>> What do you mean be selective replication? >>>> >>>> On Tue, Sep 11, 2012 at 7:23 PM, Worthy LaFollette <[email protected] >>>>> wrote: >>>> >>>>> Very good paper. Am curious now to the strategies for selective >>>>> replication, which looks if done right would make the query generation >>>> more >>>>> efficient. Do you know of any papers on that subject? >>>>> >>>>> On Tue, Sep 11, 2012 at 1:37 PM, Ted Dunning <[email protected]> >>>>> wrote: >>>>> >>>>>> Headed into Thursday's meetup, this paper by Jeff Dean provides a very >>>>> good >>>>>> description of strategies for getting fast response times with >> variable >>>>>> quality infrastructure. >>>>>> >>>>>> http://research.google.com/people/jeff/latency.html >>>>>> >>>>>> The key point here is that it is very important to have asynchronous >>>>>> queries with a cancel. Above that level, there needs to be a simple >>>>>> strategy for pushing second versions of queries out to the workers and >>>>>> canceling defunct or redundant queries. >>>> >> -- Ian Holsman [email protected] http://doitwithdata.com.au PH: +61-400-988-964 Skype:iholsman
