I still am not sure we are doing this the ideal way. I am not a believer in a commit-then-review branch.
This issue is an example. https://issues.apache.org/jira/browse/HIVE-5108 I ask myself these questions: Does this currently work? Are their tests? If so which ones are broken? How does the patch fix them without tests to validate? Having a commit-then-review branch just seems subversive to our normal process, and a quick short cut to not have to be bothered by writing tests or involving anyone else. On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates <ga...@hortonworks.com> wrote: > > On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote: > > > Also watched http://www.ustream.tv/recorded/36323173 > > > > I definitely see the win in being able to stream inter-stage output. > > > > I see some cases where small intermediate results can be kept "In > memory". > > But I was somewhat under the impression that the map reduce spill > settings > > kept stuff in memory, isn't that what spill settings are? > > No. MapReduce always writes shuffle data to local disk. And intermediate > results between MR jobs are always persisted to HDFS, as there's no other > option. When we talk of being able to keep intermediate results in memory > we mean getting rid of both of these disk writes/reads when appropriate > (meaning not always, there's a trade off between speed and error handling > to be made here, see below for more details). > > > > > There is a few bullet points that came up repeatedly that I do not > follow: > > > > Something was said to the effect of "Container reuse makes X faster". > > Hadoop has jvm reuse. Not following what the difference is here? Not > > everyone has a 10K node cluster. > > Sharing JVMs across users is inherently insecure (we can't guarantee what > code the first user left behind that may interfere with later users). As I > understand container re-use in Tez it constrains the re-use to one user for > security reasons, but still avoids additional JVM start up costs. But this > is a question that the Tez guys could answer better on the Tez lists ( > d...@tez.incubator.apache.org) > > > > > "Joins in map reduce are hard" Really? I mean some of them are I guess, > but > > the typical join is very easy. Just shuffle by the join key. There was > not > > really enough low level details here saying why joins are better in tez. > > Join is not a natural operation in MapReduce. MR gives you one input and > one output. You end up having to bend the rules to do have multiple > inputs. The idea here is that Tez can provide operators that naturally > work with joins and other operations that don't fit the one input/one > output model (eg unions, etc.). > > > > > "Chosing the number of maps and reduces is hard" Really? I do not find it > > that hard, I think there are times when it's not perfect but I do not > find > > it hard. The talk did not really offer anything here technical on how tez > > makes this better other then it could make it better. > > Perhaps manual would be a better term here than hard. In our experience > it takes quite a bit of engineer trial and error to determine the optimal > numbers. This may be ok if you're going to invest the time once and then > run the same query every day for 6 months. But obviously it doesn't work > for the ad hoc case. Even in the batch case it's not optimal because every > once and a while an engineer has to go back and re-optimize the query to > deal with changing data sizes, data characteristics, etc. We want the > optimizer to handle this without human intervention. > > > > > The presentations mentioned streaming data, how do two nodes stream data > > between a tasks and how it it reliable? If the sender or receiver dies > does > > the entire process have to start again? > > If the sender or receiver dies then the query has to be restarted from > some previous point where data was persisted to disk. The idea here is > that speed vs error recovery trade offs should be made by the optimizer. > If the optimizer estimates that a query will complete in 5 seconds it can > stream everything and if a node fails it just re-runs the whole query. If > it estimates that a particular phase of a query will run for an hour it can > choose to persist the results to HDFS so that in the event of a failure > downstream the long phase need not be re-run. Again we want this to be > done automatically by the system so the user doesn't need to control this > level of detail. > > > > > Again one of the talks implied there is a prototype out there that > launches > > hive jobs into tez. I would like to see that, it might answer more > > questions then a power point, and I could profile some common queries. > > As mentioned in a previous email afaik Gunther's pushed all these changes > to the Tez branch in Hive. > > Alan. > > > > > Random late night thoughts over, > > Ed > > > > > > > > > > > > > > On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo <edlinuxg...@gmail.com > >wrote: > > > >> At ~25:00 > >> > >> "There is a working prototype of hive which is using tez as the targeted > >> runtime" > >> > >> Can I get a look at that code? Is it on github? > >> > >> Edward > >> > >> > >> On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <ga...@hortonworks.com> > wrote: > >> > >>> Answers to some of your questions inlined. > >>> > >>> Alan. > >>> > >>> On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote: > >>> > >>>> There are some points I want to bring up. First, I am on the PMC. Here > >>> is > >>>> something I find relevant: > >>>> > >>>> http://www.apache.org/foundation/how-it-works.html > >>>> > >>>> ------------------------------ > >>>> > >>>> The role of the PMC from a Foundation perspective is oversight. The > main > >>>> role of the PMC is not code and not coding - but to ensure that all > >>> legal > >>>> issues are addressed, that procedure is followed, and that each and > >>> every > >>>> release is the product of the community as a whole. That is key to our > >>>> litigation protection mechanisms. > >>>> > >>>> Secondly the role of the PMC is to further the long term development > and > >>>> health of the community as a whole, and to ensure that balanced and > wide > >>>> scale peer review and collaboration does happen. Within the ASF we > worry > >>>> about any community which centers around a few individuals who are > >>> working > >>>> virtually uncontested. We believe that this is detrimental to quality, > >>>> stability, and robustness of both code and long term social > structures. > >>>> > >>>> -------------------------------- > >>>> > >>>> > >>> > https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different > >>>> > >>>> ------------------------------------- > >>>> > >>>> All other decisions happen on the dev list, discussions on the private > >>> list > >>>> are kept to a minimum. > >>>> > >>>> "If it didn't happen on the dev list, it didn't happen" - which leads > >>> to: > >>>> > >>>> a) Elections of committers and PMC members are published on the dev > list > >>>> once finalized. > >>>> > >>>> b) Out-of-band discussions (IRC etc.) are summarized on the dev list > as > >>>> soon as they have impact on the project, code or community. > >>>> --------------------------------- > >>>> > >>>> https://issues.apache.org/jira/browse/HIVE-4660 ironically titled > "Let > >>>> their be Tez" has not be +1 ed by any committer. It was never > discussed > >>> on > >>>> the dev or the user list (as far as I can tell). > >>> > >>> As all JIRA creations and updates are sent to dev@hive, creating a > JIRA > >>> is de facto posting to the list. > >>> > >>>> > >>>> As a PMC member I feel we need more discussion on Tez on the dev list > >>> along > >>>> with a wiki-fied design document. Topics of discussion should include: > >>> > >>> I talked with Gunther and he's working on posting a design doc on the > >>> wiki. He has a PDF on the JIRA but he doesn't have write permissions > yet > >>> on the wiki. > >>> > >>>> > >>>> 1) What is tez? > >>> In Hadoop 2.0, YARN opens up the ability to have multiple execution > >>> frameworks in Hadoop. Hadoop apps are no longer tied to MapReduce as > the > >>> only execution option. Tez is an effort to build an execution engine > that > >>> is optimized for relational data processing, such as Hive and Pig. > >>> > >>> The biggest change here is to move away from only Map and Reduce as > >>> processing options and to allow alternate combinations of processing, > such > >>> as map -> reduce -> reduce or tasks that take multiple inputs or > shuffles > >>> that avoid sorting when it isn't needed. > >>> > >>> For a good intro to Tez, see Arun's presentation on it at the recent > >>> Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides > >>> http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212 > ) > >>>> > >>>> 2) How is tez different from oozie, http://code.google.com/p/hop/, > >>>> http://cs.brown.edu/~backman/cmr.html , and other DAG and or > streaming > >>> map > >>>> reduce tools/frameworks? Why should we use this and not those? > >>> > >>> Oozie is a completely different thing. Oozie is a workflow engine and > a > >>> scheduler. It's core competencies are the ability to coordinate > workflows > >>> of disparate job types (MR, Pig, Hive, etc.) and to schedule them. It > is > >>> not intended as an execution engine for apps such as Pig and Hive. > >>> > >>> I am not familiar with these other engines, but the short answer is > that > >>> Tez is built to work on YARN, which works well for Hive since it is > tied to > >>> Hadoop. > >>>> > >>>> 3) When can we expect the first tez release? > >>> I don't know, but I hope sometime this fall. > >>> > >>>> > >>>> 4) How much effort is involved in integrating hive and tez? > >>> Covered in the design doc. > >>> > >>>> > >>>> 5) Who is ready to commit to this effort? > >>> I'll let people speak for themselves on that one. > >>> > >>>> > >>>> 6) can we expect this work to be done in one hive release? > >>> Unlikely. Initial integration will be done in one release, but as Tez > is > >>> a new project I expect it will be adding features in the future that > Hive > >>> will want to take advantage of. > >>> > >>>> > >>>> In my opinion we should not start any work on this tez-hive until > these > >>>> questions are answered to the satisfaction of the hive developers. > >>> > >>> Can we change this to "not commit patches"? We can't tell willing > people > >>> not to work on it. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo < > edlinuxg...@gmail.com > >>>> wrote: > >>>> > >>>>> > >>>>>>> The Hive bylaws, > >>>>> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out > what > >>>>> votes are needed for what. I don't see anything there about needing > 3 > >>> +1s > >>>>> for a branch. Branching >>would seem to fall under code change, > which > >>>>> requires one vote and a minimum length of 1 day. > >>>>> > >>>>> You could argue that all you need is one +1 to create a branch, but > >>> this > >>>>> is more then a branch. If you are talking about something that is: > >>>>> 1) going to cause major re-factoring of critical pieces of hive like > >>>>> ExecDriver and MapRedTask > >>>>> 2) going to be very disruptive to the efforts of other committers > >>>>> 3) something that may be a major architectural change > >>>>> > >>>>> Getting the project on board with the idea is a good idea. > >>>>> > >>>>> Now I want to point something out. Here are some recent initiatives > in > >>>>> hive: > >>>>> > >>>>> 1) At one point there was a big initiative to "support oracle" after > >>> the > >>>>> initial work, there are patches in Jira no one seems to care about > >>> oracle > >>>>> support. > >>>>> 2) Another such decisions was this "support windows" one, there are > >>>>> probably 4 windows patches waiting reviews. > >>>>> 3) I still have no clue what the official hadoop1 hadoop2, hadoop > 0.23 > >>>>> support prospective is, but every couple weeks we get another jira > >>> about > >>>>> something not working/testing on one of those versions, seems like > >>> several > >>>>> builds are broken. > >>>>> 4) Hive-storage handler, after the initial implementation no one > cares > >>> to > >>>>> review any other storage handler implementation, 3 patches there or > >>> more, > >>>>> could not even find anyone willing to review the cassandra storage > >>> handler > >>>>> I spent months on. > >>>>> 5) OCR, Vectorization > >>>>> 6) Windowing: committed, numerous check-style violations. > >>>>> > >>>>> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active > committers. > >>> We > >>>>> are spread very thin, and embarking on another side project not > >>> involved > >>>>> with core hive seems like the wrong direction at the moment. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <ga...@hortonworks.com> > >>> wrote: > >>>>> > >>>>>> > >>>>>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote: > >>>>>> > >>>>>>> I have started to see several re factoring patches around tez. > >>>>>>> https://issues.apache.org/jira/browse/HIVE-4843 > >>>>>>> > >>>>>>> This is the only mention on the hive list I can find with tez: > >>>>>>> "Makes sense. I will create the branch soon. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Ashutosh > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner < > >>>>>>> ghagleit...@hortonworks.com> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I am starting to work on integrating Tez into Hive (see HIVE-4660, > >>>>>> design > >>>>>>>> doc has already been uploaded - any feedback will be much > >>> appreciated). > >>>>>>>> This will be a fair amount of work that will take time to > >>>>>> stabilize/test. > >>>>>>>> I'd like to propose creating a branch in order to be able to do > this > >>>>>>>> incrementally and collaboratively. In order to progress rapidly > with > >>>>>> this, > >>>>>>>> I would also like to go "commit-then-review". > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Gunther. > >>>>>>>> " > >>>>>>> > >>>>>>> These refactor-ings are largely destructive to a number of bugs and > >>>>>>> language improvements in hive.The language improvements and bug > fixes > >>>>>> that > >>>>>>> have been sitting in Jira for quite some time now marked > >>> patch-available > >>>>>>> and are waiting for review. > >>>>>>> > >>>>>>> There are a few things I want to point out: > >>>>>>> 1) Normally we create design docs in out wiki (which it is not) > >>>>>>> 2) Normally when the change is significantly complex we get > multiple > >>>>>>> committers to comment on it (which we did not) > >>>>>>> On point 2 no one -1 the branch, but this is really something that > >>>>>> should > >>>>>>> have required a +1 from 3 committers. > >>>>>> > >>>>>> The Hive bylaws, > >>> https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what > >>> votes are needed for what. I don't see anything there about > >>>>>> needing 3 +1s for a branch. Branching would seem to fall under code > >>>>>> change, which requires one vote and a minimum length of 1 day. > >>>>>> > >>>>>>> > >>>>>>> I for one am not completely sold on Tez. > >>>>>>> http://incubator.apache.org/projects/tez.html. > >>>>>>> "directed-acyclic-graph of tasks for processing data" this > >>> description > >>>>>>> sounds like many things which have never become popular. One to > think > >>>>>> of is > >>>>>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of > >>>>>>> actions.". I am sure I can find a number of libraries/frameworks > that > >>>>>> make > >>>>>>> this same claim. In general I do not feel like we have done our > >>> homework > >>>>>>> and pre-requisites to justify all this work. If we have done the > >>>>>> homework, > >>>>>>> I am sure that it has not been communicated and accepted by hive > >>>>>> developers > >>>>>>> at large. > >>>>>> > >>>>>> A request for better documentation on Tez and a project road map > seems > >>>>>> totally reasonable. > >>>>>> > >>>>>>> > >>>>>>> If we have a branch, why are we also committing on trunk? Scanning > >>>>>> through > >>>>>>> the tez doc the only language I keep finding language like "minimal > >>>>>> changes > >>>>>>> to the planner" yet, there is ALREADY lots of large changes going > on! > >>>>>>> > >>>>>>> Really none of the above would bother me accept for the fact that > >>> these > >>>>>>> "minimal changes" are causing many "patch available" > ready-for-review > >>>>>> bugs > >>>>>>> and core hive features to need to be re based. > >>>>>>> > >>>>>>> I am sure I have mentioned this before, but I have to spend 12+ > >>> hours to > >>>>>>> test a single patch on my laptop. A few days ago I was testing a > new > >>>>>> core > >>>>>>> hive feature. After all the tests passed and before I was able to > >>>>>> commit, > >>>>>>> someone unleashed a tez patch on trunk which caused the thing I was > >>>>>> testing > >>>>>>> for 12 hours to need to be rebased. > >>>>>>> > >>>>>>> > >>>>>>> I'm not cool with this.Next time that happens to me I will > seriously > >>>>>>> consider reverting the patch. Bug fixes and new hive features are > >>> more > >>>>>>> important to me then integrating with incubator projects. > >>>>>> > >>>>>> (With my Apache member hat on) Reverting patches that aren't > breaking > >>>>>> the build is considered very bad form in Apache. It does make sense > >>> to > >>>>>> request that when people are going to commit a patch that will break > >>> many > >>>>>> other patches they first give a few hours of notice so people can > say > >>>>>> something if they're about to commit another patch and avoid your > >>> fate of > >>>>>> needing to rerun the tests. The other thing is we need to get get > the > >>>>>> automated build of patches working on Hive so committers are forced > >>> to run > >>>>>> all of the tests themselves. We are working on it, but we're not > >>> there yet. > >>>>>> > >>>>>> Alan. > >>>>>> > >>>>>> > >>>>> > >>> > >>> > >> > >