Re: Tez branch and tez based patches

Edward Capriolo Fri, 16 Aug 2013 06:14:30 -0700

I still am not sure we are doing this the ideal way. I am not a believer in
a commit-then-review branch.


This issue is an example.

https://issues.apache.org/jira/browse/HIVE-5108

I ask myself these questions:
Does this currently work? Are their tests? If so which ones are broken? How
does the patch fix them without tests to validate?

Having a commit-then-review branch just seems subversive to our normal
process, and a quick short cut to not have to be bothered by writing tests
or involving anyone else.



On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates <[email protected]> wrote:

>
> On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote:
>
> > Also watched http://www.ustream.tv/recorded/36323173
> >
> > I definitely see the win in being able to stream inter-stage output.
> >
> > I see some cases where small intermediate results can be kept "In
> memory".
> > But I was somewhat under the impression that the map reduce spill
> settings
> > kept stuff in memory, isn't that what spill settings are?
>
> No.  MapReduce always writes shuffle data to local disk.  And intermediate
> results between MR jobs are always persisted to HDFS, as there's no other
> option.  When we talk of being able to keep intermediate results in memory
> we mean getting rid of both of these disk writes/reads when appropriate
> (meaning not always, there's a trade off between speed and error handling
> to be made here, see below for more details).
>
> >
> > There is a few bullet points that came up repeatedly that I do not
> follow:
> >
> > Something was said to the effect of "Container reuse makes X faster".
> > Hadoop has jvm reuse. Not following what the difference is here? Not
> > everyone has a 10K node cluster.
>
> Sharing JVMs across users is inherently insecure (we can't guarantee what
> code the first user left behind that may interfere with later users).  As I
> understand container re-use in Tez it constrains the re-use to one user for
> security reasons, but still avoids additional JVM start up costs.  But this
> is a question that the Tez guys could answer better on the Tez lists (
> [email protected])
>
> >
> > "Joins in map reduce are hard" Really? I mean some of them are I guess,
> but
> > the typical join is very easy. Just shuffle by the join key. There was
> not
> > really enough low level details here saying why joins are better in tez.
>
> Join is not a natural operation in MapReduce.  MR gives you one input and
> one output.  You end up having to bend the rules to do have multiple
> inputs.  The idea here is that Tez can provide operators that naturally
> work with joins and other operations that don't fit the one input/one
> output model (eg unions, etc.).
>
> >
> > "Chosing the number of maps and reduces is hard" Really? I do not find it
> > that hard, I think there are times when it's not perfect but I do not
> find
> > it hard. The talk did not really offer anything here technical on how tez
> > makes this better other then it could make it better.
>
> Perhaps manual would be a better term here than hard.  In our experience
> it takes quite a bit of engineer trial and error to determine the optimal
> numbers.  This may be ok if you're going to invest the time once and then
> run the same query every day for 6 months.  But obviously it doesn't work
> for the ad hoc case.  Even in the batch case it's not optimal because every
> once and a while an engineer has to go back and re-optimize the query to
> deal with changing data sizes, data characteristics, etc.  We want the
> optimizer to handle this without human intervention.
>
> >
> > The presentations mentioned streaming data, how do two nodes stream data
> > between a tasks and how it it reliable? If the sender or receiver dies
> does
> > the entire process have to start again?
>
> If the sender or receiver dies then the query has to be restarted from
> some previous point where data was persisted to disk.  The idea here is
> that speed vs error recovery trade offs should be made by the optimizer.
>  If the optimizer estimates that a query will complete in 5 seconds it can
> stream everything and if a node fails it just re-runs the whole query.  If
> it estimates that a particular phase of a query will run for an hour it can
> choose to persist the results to HDFS so that in the event of a failure
> downstream the long phase need not be re-run.  Again we want this to be
> done automatically by the system so the user doesn't need to control this
> level of detail.
>
> >
> > Again one of the talks implied there is a prototype out there that
> launches
> > hive jobs into tez. I would like to see that, it might answer more
> > questions then a power point, and I could profile some common queries.
>
> As mentioned in a previous email afaik Gunther's pushed all these changes
> to the Tez branch in Hive.
>
> Alan.
>
> >
> > Random late night thoughts over,
> > Ed
> >
> >
> >
> >
> >
> >
> > On Tue, Jul 30, 2013 at 12:02 AM, Edward Capriolo <[email protected]
> >wrote:
> >
> >> At ~25:00
> >>
> >> "There is a working prototype of hive which is using tez as the targeted
> >> runtime"
> >>
> >> Can I get a look at that code? Is it on github?
> >>
> >> Edward
> >>
> >>
> >> On Wed, Jul 17, 2013 at 3:35 PM, Alan Gates <[email protected]>
> wrote:
> >>
> >>> Answers to some of your questions inlined.
> >>>
> >>> Alan.
> >>>
> >>> On Jul 16, 2013, at 10:20 PM, Edward Capriolo wrote:
> >>>
> >>>> There are some points I want to bring up. First, I am on the PMC. Here
> >>> is
> >>>> something I find relevant:
> >>>>
> >>>> http://www.apache.org/foundation/how-it-works.html
> >>>>
> >>>> ------------------------------
> >>>>
> >>>> The role of the PMC from a Foundation perspective is oversight. The
> main
> >>>> role of the PMC is not code and not coding - but to ensure that all
> >>> legal
> >>>> issues are addressed, that procedure is followed, and that each and
> >>> every
> >>>> release is the product of the community as a whole. That is key to our
> >>>> litigation protection mechanisms.
> >>>>
> >>>> Secondly the role of the PMC is to further the long term development
> and
> >>>> health of the community as a whole, and to ensure that balanced and
> wide
> >>>> scale peer review and collaboration does happen. Within the ASF we
> worry
> >>>> about any community which centers around a few individuals who are
> >>> working
> >>>> virtually uncontested. We believe that this is detrimental to quality,
> >>>> stability, and robustness of both code and long term social
> structures.
> >>>>
> >>>> --------------------------------
> >>>>
> >>>>
> >>>
> https://blogs.apache.org/comdev/entry/what_makes_apache_projects_different
> >>>>
> >>>> -------------------------------------
> >>>>
> >>>> All other decisions happen on the dev list, discussions on the private
> >>> list
> >>>> are kept to a minimum.
> >>>>
> >>>> "If it didn't happen on the dev list, it didn't happen" - which leads
> >>> to:
> >>>>
> >>>> a) Elections of committers and PMC members are published on the dev
> list
> >>>> once finalized.
> >>>>
> >>>> b) Out-of-band discussions (IRC etc.) are summarized on the dev list
> as
> >>>> soon as they have impact on the project, code or community.
> >>>> ---------------------------------
> >>>>
> >>>> https://issues.apache.org/jira/browse/HIVE-4660 ironically titled
> "Let
> >>>> their be Tez" has not be +1 ed by any committer. It was never
> discussed
> >>> on
> >>>> the dev or the user list (as far as I can tell).
> >>>
> >>> As all JIRA creations and updates are sent to dev@hive, creating a
> JIRA
> >>> is de facto posting to the list.
> >>>
> >>>>
> >>>> As a PMC member I feel we need more discussion on Tez on the dev list
> >>> along
> >>>> with a wiki-fied design document. Topics of discussion should include:
> >>>
> >>> I talked with Gunther and he's working on posting a design doc on the
> >>> wiki.  He has a PDF on the JIRA but he doesn't have write permissions
> yet
> >>> on the wiki.
> >>>
> >>>>
> >>>> 1) What is tez?
> >>> In Hadoop 2.0, YARN opens up the ability to have multiple execution
> >>> frameworks in Hadoop.  Hadoop apps are no longer tied to MapReduce as
> the
> >>> only execution option.  Tez is an effort to build an execution engine
> that
> >>> is optimized for relational data processing, such as Hive and Pig.
> >>>
> >>> The biggest change here is to move away from only Map and Reduce as
> >>> processing options and to allow alternate combinations of processing,
> such
> >>> as map -> reduce -> reduce or tasks that take multiple inputs or
> shuffles
> >>> that avoid sorting when it isn't needed.
> >>>
> >>> For a good intro to Tez, see Arun's presentation on it at the recent
> >>> Hadoop summit (video http://www.youtube.com/watch?v=9ZLLzlsz7h8 slides
> >>> http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212
> )
> >>>>
> >>>> 2) How is tez different from oozie, http://code.google.com/p/hop/,
> >>>> http://cs.brown.edu/~backman/cmr.html , and other DAG and or
> streaming
> >>> map
> >>>> reduce tools/frameworks? Why should we use this and not those?
> >>>
> >>> Oozie is a completely different thing.  Oozie is a workflow engine and
> a
> >>> scheduler.  It's core competencies are the ability to coordinate
> workflows
> >>> of disparate job types (MR, Pig, Hive, etc.) and to schedule them.  It
> is
> >>> not intended as an execution engine for apps such as Pig and Hive.
> >>>
> >>> I am not familiar with these other engines, but the short answer is
> that
> >>> Tez is built to work on YARN, which works well for Hive since it is
> tied to
> >>> Hadoop.
> >>>>
> >>>> 3) When can we expect the first tez release?
> >>> I don't know, but I hope sometime this fall.
> >>>
> >>>>
> >>>> 4) How much effort is involved in integrating hive and tez?
> >>> Covered in the design doc.
> >>>
> >>>>
> >>>> 5) Who is ready to commit to this effort?
> >>> I'll let people speak for themselves on that one.
> >>>
> >>>>
> >>>> 6) can we expect this work to be done in one hive release?
> >>> Unlikely.  Initial integration will be done in one release, but as Tez
> is
> >>> a new project I expect it will be adding features in the future that
> Hive
> >>> will want to take advantage of.
> >>>
> >>>>
> >>>> In my opinion we should not start any work on this tez-hive until
> these
> >>>> questions are answered to the satisfaction of the hive developers.
> >>>
> >>> Can we change this to "not commit patches"?  We can't tell willing
> people
> >>> not to work on it.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Jul 15, 2013 at 9:51 PM, Edward Capriolo <
> [email protected]
> >>>> wrote:
> >>>>
> >>>>>
> >>>>>>> The Hive bylaws,
> >>>>> https://cwiki.apache.org/confluence/display/Hive/Bylaws , lay out
> what
> >>>>> votes are needed for what.  I don't see anything there about needing
> 3
> >>> +1s
> >>>>> for a branch.  Branching >>would seem to fall under code change,
> which
> >>>>> requires one vote and a minimum length of 1 day.
> >>>>>
> >>>>> You could argue that all you need is one +1 to create a branch, but
> >>> this
> >>>>> is more then a branch. If you are talking about something that is:
> >>>>> 1) going to cause major re-factoring of critical pieces of hive like
> >>>>> ExecDriver and MapRedTask
> >>>>> 2) going to be very disruptive to the efforts of other committers
> >>>>> 3) something that may be a major architectural change
> >>>>>
> >>>>> Getting the project on board with the idea is a good idea.
> >>>>>
> >>>>> Now I want to point something out. Here are some recent initiatives
> in
> >>>>> hive:
> >>>>>
> >>>>> 1) At one point there was a big initiative to "support oracle" after
> >>> the
> >>>>> initial work, there are patches in Jira no one seems to care about
> >>> oracle
> >>>>> support.
> >>>>> 2) Another such decisions was this "support windows" one, there are
> >>>>> probably 4 windows patches waiting reviews.
> >>>>> 3) I still have no clue what the official hadoop1 hadoop2, hadoop
> 0.23
> >>>>> support prospective is, but every couple weeks we get another jira
> >>> about
> >>>>> something not working/testing on one of those versions, seems like
> >>> several
> >>>>> builds are broken.
> >>>>> 4) Hive-storage handler, after the initial implementation no one
> cares
> >>> to
> >>>>> review any other storage handler implementation, 3 patches there or
> >>> more,
> >>>>> could not even find anyone willing to review the cassandra storage
> >>> handler
> >>>>> I spent months on.
> >>>>> 5) OCR, Vectorization
> >>>>> 6) Windowing: committed, numerous check-style violations.
> >>>>>
> >>>>> We have !!!160+!!! PATCH_AVAILABLE Jira issues. Few active
> committers.
> >>> We
> >>>>> are spread very thin, and embarking on another side project not
> >>> involved
> >>>>> with core hive seems like the wrong direction at the moment.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 15, 2013 at 8:37 PM, Alan Gates <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>>
> >>>>>> On Jul 13, 2013, at 9:48 AM, Edward Capriolo wrote:
> >>>>>>
> >>>>>>> I have started to see several re factoring patches around tez.
> >>>>>>> https://issues.apache.org/jira/browse/HIVE-4843
> >>>>>>>
> >>>>>>> This is the only mention on the hive list I can find with tez:
> >>>>>>> "Makes sense. I will create the branch soon.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Ashutosh
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Jun 11, 2013 at 7:44 PM, Gunther Hagleitner <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I am starting to work on integrating Tez into Hive (see HIVE-4660,
> >>>>>> design
> >>>>>>>> doc has already been uploaded - any feedback will be much
> >>> appreciated).
> >>>>>>>> This will be a fair amount of work that will take time to
> >>>>>> stabilize/test.
> >>>>>>>> I'd like to propose creating a branch in order to be able to do
> this
> >>>>>>>> incrementally and collaboratively. In order to progress rapidly
> with
> >>>>>> this,
> >>>>>>>> I would also like to go "commit-then-review".
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Gunther.
> >>>>>>>> "
> >>>>>>>
> >>>>>>> These refactor-ings are largely destructive to a number of bugs and
> >>>>>>> language improvements in hive.The language improvements and bug
> fixes
> >>>>>> that
> >>>>>>> have been sitting in Jira for quite some time now marked
> >>> patch-available
> >>>>>>> and are waiting for review.
> >>>>>>>
> >>>>>>> There are a few things I want to point out:
> >>>>>>> 1) Normally we create design docs in out wiki (which it is not)
> >>>>>>> 2) Normally when the change is significantly complex we get
> multiple
> >>>>>>> committers to comment on it (which we did not)
> >>>>>>> On point 2 no one -1  the branch, but this is really something that
> >>>>>> should
> >>>>>>> have required a +1 from 3 committers.
> >>>>>>
> >>>>>> The Hive bylaws,
> >>> https://cwiki.apache.org/confluence/display/Hive/Bylaws, lay out what
> >>> votes are needed for what.  I don't see anything there about
> >>>>>> needing 3 +1s for a branch.  Branching would seem to fall under code
> >>>>>> change, which requires one vote and a minimum length of 1 day.
> >>>>>>
> >>>>>>>
> >>>>>>> I for one am not completely sold on Tez.
> >>>>>>> http://incubator.apache.org/projects/tez.html.
> >>>>>>> "directed-acyclic-graph of tasks for processing data" this
> >>> description
> >>>>>>> sounds like many things which have never become popular. One to
> think
> >>>>>> of is
> >>>>>>> oozie "Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of
> >>>>>>> actions.". I am sure I can find a number of libraries/frameworks
> that
> >>>>>> make
> >>>>>>> this same claim. In general I do not feel like we have done our
> >>> homework
> >>>>>>> and pre-requisites to justify all this work. If we have done the
> >>>>>> homework,
> >>>>>>> I am sure that it has not been communicated and accepted by hive
> >>>>>> developers
> >>>>>>> at large.
> >>>>>>
> >>>>>> A request for better documentation on Tez and a project road map
> seems
> >>>>>> totally reasonable.
> >>>>>>
> >>>>>>>
> >>>>>>> If we have a branch, why are we also committing on trunk? Scanning
> >>>>>> through
> >>>>>>> the tez doc the only language I keep finding language like "minimal
> >>>>>> changes
> >>>>>>> to the planner" yet, there is ALREADY lots of large changes going
> on!
> >>>>>>>
> >>>>>>> Really none of the above would bother me accept for the fact that
> >>> these
> >>>>>>> "minimal changes" are causing many "patch available"
> ready-for-review
> >>>>>> bugs
> >>>>>>> and core hive features to need to be re based.
> >>>>>>>
> >>>>>>> I am sure I have mentioned this before, but I have to spend 12+
> >>> hours to
> >>>>>>> test a single patch on my laptop. A few days ago I was testing a
> new
> >>>>>> core
> >>>>>>> hive feature. After all the tests passed and before I was able to
> >>>>>> commit,
> >>>>>>> someone unleashed a tez patch on trunk which caused the thing I was
> >>>>>> testing
> >>>>>>> for 12 hours to need to be rebased.
> >>>>>>>
> >>>>>>>
> >>>>>>> I'm not cool with this.Next time that happens to me I will
> seriously
> >>>>>>> consider reverting the patch. Bug fixes and new hive features are
> >>> more
> >>>>>>> important to me then integrating with incubator projects.
> >>>>>>
> >>>>>> (With my Apache member hat on)  Reverting patches that aren't
> breaking
> >>>>>> the build is considered very bad form in Apache.  It does make sense
> >>> to
> >>>>>> request that when people are going to commit a patch that will break
> >>> many
> >>>>>> other patches they first give a few hours of notice so people can
> say
> >>>>>> something if they're about to commit another patch and avoid your
> >>> fate of
> >>>>>> needing to rerun the tests.  The other thing is we need to get get
> the
> >>>>>> automated build of patches working on Hive so committers are forced
> >>> to run
> >>>>>> all of the tests themselves.  We are working on it, but we're not
> >>> there yet.
> >>>>>>
> >>>>>> Alan.
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Tez branch and tez based patches

Reply via email to