Re: Tez branch and tez based patches

Ashutosh Chauhan Wed, 17 Jul 2013 17:44:08 -0700

On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:


>
> "In my opinion we should limit the amount of tez related optimizations to
> and trunk" Refactoring that cleans up code is good, but as you have pointed
> out there wont be a tez release until sometime this fall, and this branch
> will be open for an extended period of time. Thus code cleanups and other
> tez related refactoring does not need to be disruptive to trunk.


I agree Tez specific changes need not to go in trunk. But general
refactoring and code cleanup needs to happen on trunk as and when someone
is willing to work on those. We have to continually improve our code
quality. Code maintainability and readability is a priority. Without that
code quality suffers and discourages new contributors to contribute because
code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We
need to simplify it. Patch like HIVE-4811 is a welcome change which tackled
it. Exec package is all convoluted which mixes up runtime operators and
drivers for runtime. Thats a welcome patch because it makes it much more
easy to read and reason about that piece of code. HIVE-4825 is another
example which improves modularity of code. For contributors who are exposed
to Hive first time it will be easier for them to follow the code.

Rather than disruptive to trunk, they are constructive for trunk and I am
glad people are choosing to work on that. Tez or no Tez Hive is better off
with these patches.

Thanks,
Ashutosh



>  On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <ga...@hortonworks.com>
> wrote:
>
> > Answers to some of your questions inlined.
> >
> > Alan.
> >
> > On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
> >
> > > There are some points I want to bring up. First, I am on the PMC. Here
> is
> > > something I find relevant:
> > >
> > > http://www.apache.org/foundation/how-it-works.html
> > >
> > > ------------------------------
> > >
> > > The role of the PMC from a Foundation perspective is oversight. The
> main
> > > role of the PMC is not code and not coding - but to ensure that all
> legal
> > > issues are addressed, that procedure is followed, and that each and
> every
> > > release is the product of the community as a whole. That is key to our
> > > litigation protection mechanisms.
> > >
> > > Secondly the role of the PMC is to further the long term development
> and
> > > health of the community as a whole, and to ensure that balanced and
> wide
> > > scale peer review and collaboration does happen. Within the ASF we
> worry
> > > about any community which centers around a few individuals who are
> > working
> > > virtually uncontested. We believe that this is detrimental to quality,
> > > stability, and robustness of both code and long term social structures.
> > >
> > > --------------------------------
> > >
> > >
> >
> https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
> > >
> > > -------------------------------------
> > >
> > > All other decisions happen on the dev list, discussions on the private
> > list
> > > are kept to a minimum.
> > >
> > > "If it didn't happen on the dev list, it didn't happen" - which leads
> to:
> > >
> > > a) Elections of committers and PMC members are published on the dev
> list
> > > once finalized.
> > >
> > > b) Out-of-band discussions (IRC etc.) are summarized on the dev list as
> > > soon as they have impact on the project, code or community.
> > > ---------------------------------
> > >
> > > https://issues.apache.org/jira/browse/HIVE-4660 ironically titled "Let
> > > their be Tez" has not be +1 ed by any committer. It was never discussed
> > on
> > > the dev or the user list (as far as I can tell).
> >
> > As all JIRA creations and updates are sent to dev@hive, creating a JIRA
> > is de facto posting to the list.
> >
> > >
> > > As a PMC member I feel we need more discussion on Tez on the dev list
> > along
> > > with a wiki-fied design document. Topics of discussion should include:
> >
> > I talked with Gunther and he's working on posting a design doc on the
> > wiki.  He has a PDF on the JIRA but he doesn't have write permissions yet
> > on the wiki.
> >
> > >
> > > 1) What is tez?
> > In Hadoop 2.0, YARN opens up the ability to have multiple execution
> > frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as the
> > only execution option.  Tez is an effort to build an execution engine
> that
> > is optimized for relational data processing, such as Hive and Pig.
> >
> > The biggest change here is to move away from only Map and Reduce as
> > processing options and to allow alternate combinations of processing,
> such
> > as map -> reduce -> reduce or tasks that take multiple inputs or shuffles
> > that avoid sorting when it isn't needed.
> >
> > For a good intro to Tez, see Arun's presentation on it at the recent
> > Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
> > http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212)
> > >
> > > 2) How is tez different from oozie, http://code.google.com/p/hop/,
> > > http://cs.brown.edu/~backman/cmr.html , and other DAG and or streaming
> > map
> > > reduce tools/frameworks? Why should we use this and not those?
> >
> > Oozie is a completely different thing.  Oozie is a workflow engine and a
> > scheduler.  It's core competencies are the ability to coordinate
> workflows
> > of disparate job types (MR, Pig, Hive, etc.) and to schedule them.  It is
> > not intended as an execution engine for apps such as Pig and Hive.
> >
> > I am not familiar with these other engines, but the short answer is that
> > Tez is built to work on YARN, which works well for Hive since it is tied
> to
> > Hadoop.
> > >
> > > 3) When can we expect the first tez release?
> > I don't know, but I hope sometime this fall.
> >
> > >
> > > 4) How much effort is involved in integrating hive and tez?
> > Covered in the design doc.
> >
> > >
> > > 5) Who is ready to commit to this effort?
> > I'll let people speak for themselves on that one.
> >
> > >
> > > 6) can we expect this work to be done in one hive release?
> > Unlikely.  Initial integration will be done in one release, but as Tez is
> > a new project I expect it will be adding features in the future that Hive
> > will want to take advantage of.
> >
> > >
> > > In my opinion we should not start any work on this tez-hive until these
> > > questions are answered to the satisfaction of the hive developers.
> >
> > Can we change this to "not commit patches"?  We can't tell willing people
> > not to work on it.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo <
> edlinuxg...@gmail.com
> > >wrote:
> > >
> > >>
> > >>>> The Hive bylaws,
> > >> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out
> what
> > >> votes are needed for what.  I don't see anything there about needing 3
> > +1s
> > >> for a branch.  Branching >>would seem to fall under code change, which
> > >> requires one vote and a minimum length of 1 day.
> > >>
> > >> You could argue that all you need is one +1 to create a branch, but
> this
> > >> is more then a branch. If you are talking about something that is:
> > >> 1) going to cause major re-factoring of critical pieces of hive like
> > >> ExecDriver and MapRedTask
> > >> 2) going to be very disruptive to the efforts of other committers
> > >> 3) something that may be a major architectural change
> > >>
> > >> Getting the project on board with the idea is a good idea.
> > >>
> > >> Now I want to point something out. Here are some recent initiatives in
> > >> hive:
> > >>
> > >> 1) At one point there was a big initiative to "support oracle" after
> the
> > >> initial work, there are patches in Jira no one seems to care about
> > oracle
> > >> support.
> > >> 2) Another such decisions was this "support windows" one, there are
> > >> probably 4 windows patches waiting reviews.
> > >> 3) I still have no clue what the official hadoop1 hadoop2, hadoop 0.23
> > >> support prospective is, but every couple weeks we get another jira
> about
> > >> something not working/testing on one of those versions, seems like
> > several
> > >> builds are broken.
> > >> 4) Hive-storage handler, after the initial implementation no one cares
> > to
> > >> review any other storage handler implementation, 3 patches there or
> > more,
> > >> could not even find anyone willing to review the cassandra storage
> > handler
> > >> I spent months on.
> > >> 5) OCR, Vectorization
> > >> 6) Windowing: committed, numerous check-style violations.
> > >>
> > >> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active committers.
> > We
> > >> are spread very thin, and embarking on another side project not
> involved
> > >> with core hive seems like the wrong direction at the moment.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <ga...@hortonworks.com>
> > wrote:
> > >>
> > >>>
> > >>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:
> > >>>
> > >>>> I have started to see several re factoring patches around tez.
> > >>>> https://issues.apache.org/jira/browse/HIVE-4843
> > >>>>
> > >>>> This is the only mention on the hive list I can find with tez:
> > >>>> "Makes sense. I will create the branch soon.
> > >>>>
> > >>>> Thanks,
> > >>>> Ashutosh
> > >>>>
> > >>>>
> > >>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner <
> > >>>> ghagleit...@hortonworks.com> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I am starting to work on integrating Tez into Hive (see HIVE-4660,
> > >>> design
> > >>>>> doc has already been uploaded - any feedback will be much
> > appreciated).
> > >>>>> This will be a fair amount of work that will take time to
> > >>> stabilize/test.
> > >>>>> I'd like to propose creating a branch in order to be able to do
> this
> > >>>>> incrementally and collaboratively. In order to progress rapidly
> with
> > >>> this,
> > >>>>> I would also like to go "commit-then-review".
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Gunther.
> > >>>>> "
> > >>>>
> > >>>> These refactor-ings are largely destructive to a number of bugs and
> > >>>> language improvements in hive.The language improvements and bug
> fixes
> > >>> that
> > >>>> have been sitting in Jira for quite some time now marked
> > patch-available
> > >>>> and are waiting for review.
> > >>>>
> > >>>> There are a few things I want to point out:
> > >>>> 1) Normally we create design docs in out wiki (which it is not)
> > >>>> 2) Normally when the change is significantly complex we get multiple
> > >>>> committers to comment on it (which we did not)
> > >>>> On point 2 no one -1  the branch, but this is really something that
> > >>> should
> > >>>> have required a +1 from 3 committers.
> > >>>
> > >>> The Hive bylaws,
> > https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what
> > votes are needed for what.  I don't see anything there about
> > >>> needing 3 +1s for a branch.  Branching would seem to fall under code
> > >>> change, which requires one vote and a minimum length of 1 day.
> > >>>
> > >>>>
> > >>>> I for one am not completely sold on Tez.
> > >>>> http://incubator.apache.org/projects/tez.html.
> > >>>> "directed-acyclic-graph of tasks for processing data" this
> description
> > >>>> sounds like many things which have never become popular. One to
> think
> > >>> of is
> > >>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of
> > >>>> actions.". I am sure I can find a number of libraries/frameworks
> that
> > >>> make
> > >>>> this same claim. In general I do not feel like we have done our
> > homework
> > >>>> and pre-requisites to justify all this work. If we have done the
> > >>> homework,
> > >>>> I am sure that it has not been communicated and accepted by hive
> > >>> developers
> > >>>> at large.
> > >>>
> > >>> A request for better documentation on Tez and a project road map
> seems
> > >>> totally reasonable.
> > >>>
> > >>>>
> > >>>> If we have a branch, why are we also committing on trunk? Scanning
> > >>> through
> > >>>> the tez doc the only language I keep finding language like "minimal
> > >>> changes
> > >>>> to the planner" yet, there is ALREADY lots of large changes going
> on!
> > >>>>
> > >>>> Really none of the above would bother me accept for the fact that
> > these
> > >>>> "minimal changes" are causing many "patch available"
> ready-for-review
> > >>> bugs
> > >>>> and core hive features to need to be re based.
> > >>>>
> > >>>> I am sure I have mentioned this before, but I have to spend 12+
> hours
> > to
> > >>>> test a single patch on my laptop. A few days ago I was testing a new
> > >>> core
> > >>>> hive feature. After all the tests passed and before I was able to
> > >>> commit,
> > >>>> someone unleashed a tez patch on trunk which caused the thing I was
> > >>> testing
> > >>>> for 12 hours to need to be rebased.
> > >>>>
> > >>>>
> > >>>> I'm not cool with this.Next time that happens to me I will seriously
> > >>>> consider reverting the patch. Bug fixes and new hive features are
> more
> > >>>> important to me then integrating with incubator projects.
> > >>>
> > >>> (With my Apache member hat on)  Reverting patches that aren't
> breaking
> > >>> the build is considered very bad form in Apache.  It does make sense
> to
> > >>> request that when people are going to commit a patch that will break
> > many
> > >>> other patches they first give a few hours of notice so people can say
> > >>> something if they're about to commit another patch and avoid your
> fate
> > of
> > >>> needing to rerun the tests.  The other thing is we need to get get
> the
> > >>> automated build of patches working on Hive so committers are forced
> to
> > run
> > >>> all of the tests themselves.  We are working on it, but we're not
> > there yet.
> > >>>
> > >>> Alan.
> > >>>
> > >>>
> > >>
> >
> >
>

Re: Tez branch and tez based patches

Reply via email to