Oops, meant to include a reference as an example of streaming algorithms: https://github.com/clearspring/stream-lib
On Fri, Apr 5, 2013 at 8:34 AM, Jacques Nadeau <[email protected]> wrote: > The current thinking is that there will be an approximate query flag. > This will be useful in situations where parallel approximations can be > made. The simplest example is you want a top 10 group by attr1. You can > do a local top N group by attr1 and then merge those results. While not > exactly right, it can be statistically accurate based on the right choice > of N. There is also parallel approximations for other things such as > median using streaming algorithms. The goal is for Drill to be able to use > these approximation algorithms in a processing tree for more queries. In > the case that a user needs exact results, full shuffle/aggregations will > still need to be done. They will still benefit from avoiding the various > MapReduce barriers and requirements for persistence between stages. > > J > > > On Thu, Apr 4, 2013 at 10:31 PM, devansh kumar <[email protected]>wrote: > >> Hi, >> >> I understood what you wanted to say of using SUM and COUNT for >> calculating AVERAGE. >> But as i understand this will work very well with Distributed >> operations..... what about operations like Median. >> >> Also i wanted to ask how the query will be broken up in >> the execution engine. >> I have gone through the Apache drill documentation and also Google Dremel >> paper, and i am still confused that how multiple level of aggregation >> will be created inside one tree. >> >> Thanks! >> >> >> >> ________________________________ >> From: devansh kumar <[email protected]> >> To: Andrew Brust <[email protected]>; " >> [email protected]" <[email protected]>; " >> [email protected]" <[email protected]> >> Sent: Friday, April 5, 2013 10:18 AM >> Subject: Re: Basic queries regarding Apache Drill working >> >> >> Hi, >> >> As Andrew asked, how will average work without an operation of Reduce >> present. >> Can you explain more on how will the data be aggregated? >> >> >> >> >> ________________________________ >> From: Andrew Brust <[email protected]> >> To: "[email protected]" <[email protected]>; >> devansh kumar <[email protected]> >> Sent: Thursday, April 4, 2013 8:00 PM >> Subject: RE: Basic queries regarding Apache Drill working >> >> Still not sure I follow (and pardon what must be a very rudimentary >> misunderstanding on my part) how you get an average across a data set if >> the data is split across nodes. With MapReduce, the reducer can get it >> because all data for a given key is kept to one node. How would this work >> with Drill? >> >> -----Original Message----- >> From: Ted Dunning [mailto:[email protected]] >> Sent: Thursday, April 4, 2013 9:27 AM >> To: [email protected]; devansh kumar >> Subject: Re: Basic queries regarding Apache Drill working >> >> On Thu, Apr 4, 2013 at 12:27 PM, devansh kumar <[email protected] >> >wrote: >> >> > Hi, >> > >> > I am new and am >> trying to understand how Apache Drill works but i >> > have a few queries. >> > Can anyone help me understand these things? >> > >> > 1. >> > I am trying to understand if the execution engine is going to break up >> > the data. >> > >> >> Normally the data will already have been broken up across a cluster. >> >> >> > What will happen if i am trying to an aggregation operation like >> (AVERAGE). >> > How will that work?? >> > >> >> Yes. >> >> >> > I have seen operations as SUM and COUNT. >> > How will the Query execution tree look like in case of an AVERAGE >> > >> >> It will look exactly like a SUM or COUNT except that two numbers will be >> accumulated instead of one. >> >> >> > 2. >> > Does the Resource model is optimized when compared to MapReduce. >> > >> >> Yes. This will happen because multiple levels of aggregation can be done >> in one tree without the barrier between map and reduce >> imposed by the MapReduce structure. >> > >
