You might start with https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/data-stream-mgr.h https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/data-stream-sender.h https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/runtime/data-stream-recvr.h https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/exchange-node.h
"Volcano : an extensible and parallel query evaluation system": http://digitalcommons.ohsu.edu/cgi/viewcontent.cgi?article=1191&context=csetech "Impala: A Modern, Open-Source SQL Engine for Hadoop": http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf , http://www.cidrdb.org/cidr2015/Slides/28_CIDR15_Slides_Paper28.pdf Speaking for myself, I would like to see and understand more about your multi-query modifications (design documents, benchmarks, code). This will affect how I feel about (a) How Impala benefits and (b) whether any changes are sufficiently risky to justify separate branching On Thu, Mar 17, 2016 at 5:29 AM, Jim Apple <[email protected]> wrote: > +cc:[email protected] > > On Wed, Mar 16, 2016 at 10:38 PM, 林言 <[email protected]> wrote: > >> We know that each planfragment has only one destination node in Impala. >> Now we want to send the intermidiate results of this fragment to more than >> one destination node. But we're only familiar with the data structure and >> execution flow in the frontend. So we wonder where we should modify in >> the thrift and backend to make it work. >> Can you share some design document? So we can know more design details of >> Impala. >> If you are interested in multi-query adaption in Impala, would you like >> to work with us in a new branch of Impala? >> >> >> ------------------------------ >> Yan Lin >> >> >> *From:* Jim Apple <[email protected]> >> *Date:* 2016-03-17 01:06 >> *To:* Impala Dev <[email protected]> >> *CC:* bbbbaai <[email protected]> >> *Subject:* Re: About Cooperating For A Better Impala >> I'm sure everyone will be delighted to have more communication and >> cooperation, including reading the papers and the code. Can you share those >> today, or is that part of the "puzzle" of "sharing intermediate results"? >> Is there anything we can do to help with your puzzlement? >> >> On Wednesday, March 16, 2016 at 12:11:05 AM UTC-7, 林言 wrote: >>> >>> Dear Sir/Madam: >>> Hello! I am Yan Lin, a master candidate in ZheJiang >>> University(CHN) in laboratory "PCL" (http://percom.zju.edu.cn/). Our >>> lab has done many works on Impala, as follows: >>> 1. We proposed an Impala query optimization method >>> based on bushy-tree and an IMPROVED-MCCHYP algorithm [1]. And we >>> implemented our method and algorithm in Impala. >>> 2. We proposed a replication-selection based scheduling >>> algorithm and implemented it in Impala [2]. >>> 3. Some of my fellows are now developing a simulator of >>> Impala called ImpalaSim and writing the corresponding paper [3]. >>> Recently, we put our focus on multi-query optimization which >>> sufficiently exploits >>> common sub-expressions of batched queries and improves the efficiency. We >>> have modified some source code, and the modified Impala can already >>> execute multiple queries in the same query context. But we still feel >>> puzzled with sharing intermediate results. We hope for more >>> communication and cooperation in every aspect. We all want a better >>> Impala. >>> Thank you for your attention! Hope to hear from you soon! >>> >>> >>> Yours Sincerely, >>> >>> >>> Yan Lin >>> >>> Reference: >>> [1] >>> Bushy Tree and Improved-McCHyp Algorithm Based Impala Query Optimization >>> [2] >>> Replication-Selection based Scheduling for Impala Parallel Query Execution >>> [3] ImpalaSim:Discrete Event Simulation Platform for Impala >>> System >>> ------------------------------ >>> Yan Lin >>> >> >
