Ramana,

I think the issue with licenses is mostly resolved. It was discussed that
for TPC-*, since we shall not be redistributing the data-gen software, but
distributing a randomized variant of the data generated by it, we should be
okay to include it part of our framework. For other datasets, we shall
either provide their copy of license with our framework, or simply provide
a link for users to download data before they execute.

For now we should focus on having the framework out with minimal cleanup.
In near future we can work on setting up infrastructure and enhancing the
framework itself.

-Abhishek

On Wed, Aug 5, 2015 at 10:46 AM, Ramana I N <[email protected]
<javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:

> @Jacques, Ted
>
> in the mean time, we risk patches being merged that have less than complete
> > testing.
>
>
> While I agree with the premise of getting the tests out as soon as possible
> it does not help us achieve anything except transparency. Your statement
> that getting the tests out will increase quality is dependent on someone
> actually being able to run the tests once they have access to it.
>
> Maybe we should focus on making a jenkins job to run the tests publicly.
> With that in place we can exclude the TPC* datasets as well as the yelp
> data sets from the framework and avoid licensing issues.
>
> Regards
> Ramana
>
>
> On Tue, Aug 4, 2015 at 11:39 AM, Abhishek Girish <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> wrote:
>
> > We not only re-distribute external data-sets as-is, but also include
> > variants for those (text -> parquet, json, ...). So the challenge here is
> > not simply disabling automatic downloads via the framework, and point
> users
> > to manually download the files before running the framework, but also
> about
> > how we will handle tests which require variants of the data sets. It
> simply
> > isn't practical to users of the framework to (1) download data-gen
> manually
> > (2) use specific seed / options before generating data, (3) convert them
> to
> > parquet, etc.. (4) move them to specific locations inside their copy of
> the
> > framework.
> >
> > Something we'll need to know is how other projects are handling
> bench-mark
> > & other external datasets.
> >
> > -Abhishek
> >
> > On Tue, Aug 4, 2015 at 11:23 AM, rahul challapalli <
> > [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
> >
> > > Thanks for your inputs.
> > >
> > > Once issue with just publishing the tests in their current state is
> that,
> > > the framework re-distributes tpch, tpcds, yelp data sets without
> > requiring
> > > the users to accept their relevant licenses. A good number of tests
> uses
> > > these data sets. Any thoughts on how to handle this?
> > >
> > > - Rahul
> > >
> > > On Wed, Jul 29, 2015 at 12:07 AM, Ted Dunning <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> > > wrote:
> > >
> > > > +1.  Get it out there.
> > > >
> > > >
> > > >
> > > > On Tue, Jul 28, 2015 at 10:12 PM, Jacques Nadeau <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> > > > wrote:
> > > >
> > > > > Hey Rahul,
> > > > >
> > > > > My suggestion would be to the lower bar--do the absolute bare
> minimum
> > > to
> > > > > get the tests out there.  For example, simply remove proprietary
> > > > > information and then get it on a public github (whether your
> personal
> > > > > github or a corporate one).  From there, people can help by
> > submitting
> > > > pull
> > > > > requests to improve the infrastructure and harness.  Making things
> > > easier
> > > > > is something that can be done over time.  For example, we've had
> > offers
> > > > > from a couple different Linux Admins to help on something.  I'm
> sure
> > > that
> > > > > they could help with a number of the items you've identified.  In
> the
> > > > mean
> > > > > time, we risk patches being merged that have less than complete
> > > testing.
> > > > >
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Mon, Jul 27, 2015 at 2:16 PM, rahul challapalli <
> > > > > [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
> > > > >
> > > > > > Jacques,
> > > > > >
> > > > > > I am breaking down steps 1,2 & 3 into sub-tasks so we can
> > > > add/prioritize
> > > > > > these tasks
> > > > > >
> > > > > > Item #TaskSub-TaskCommentsPriority1*Publish the tests*
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Remove Proprietary Data & Queries
> > > > > > 0
> > > > > >
> > > > > > Redact Propriety Data/Queries
> > > > > >
> > > > > >
> > > > > >
> > > > > > Move tests into drill repo
> > > > > > This requires some refactoring to the framework code since the
> test
> > > > > > framework uses a 2-level directory structure
> > > > > >
> > > > > >
> > > > > >
> > > > > > Organize the tests using a label based approach
> > > > > > This involves code changes and moving a lot of files. When doing
> a
> > > one
> > > > > time
> > > > > > push it might be better to do this before publishing the tests?
> > > > > >
> > > > > >
> > > > > > Each suite should be independentSome suites wrongly assume that
> the
> > > > data
> > > > > is
> > > > > > present. They should be identified and fixed
> > > > > >
> > > > > >
> > > > > > Cleanup hardcoded dependencies during data generationSome
> data-gen
> > > > > scripts
> > > > > > have hard-coded references
> > > > > >
> > > > > >
> > > > > > Cleanup downloadsThe same dataset is being downloaded multiple
> > times
> > > by
> > > > > > different suites
> > > > > >
> > > > > >
> > > > > > Licenses for downloadsThe framework downloads some files
> > > automatically.
> > > > > > These files are publicly available.
> > > > > > However before downloading them users need to agree to certain
> > terms.
> > > > By
> > > > > > using the framework users might be skipping this step. We should
> > look
> > > > > into
> > > > > > this
> > > > > > 2*Setup a cluster infrastructure to run the pre-commit tests*
> > > > > >
> > > > > >
> > > > > > 3*Local debugging of tests*
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Add an optional maven target for running tests on a local machine
> > > > > > Tests can launch an embedded drillbit or they can connect to a
> > > running
> > > > > > drillbit through zookeeper
> > > > > >
> > > > > >
> > > > > > Running suites which require additional setup (hive, hbase etc)
> > > should
> > > > be
> > > > > > made optional
> > > > > >
> > > > > > 4*Documentation*
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Running Tests (options available and also listing the asumed
> > > defaults)
> > > > > >
> > > > > >
> > > > > >
> > > > > > Explaining how tests are organized
> > > > > >
> > > > > >
> > > > > >
> > > > > > Process for adding a new suite
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 24, 2015 at 1:40 PM, Jacques Nadeau <
> > [email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> > > > > > wrote:
> > > > > >
> > > > > > > Let's get number one done (tests out there so all community
> > members
> > > > can
> > > > > > run
> > > > > > > them).  Then the whole community can work together to solve the
> > > rest.
> > > > > > >
> > > > > > > I don't think the base install should include integration test
> > > > > execution.
> > > > > > > I do think the tests should be in the main repo (as opposed to
> a
> > > > > > > secondary).
> > > > > > >
> > > > > > > We should strive to ultimately make running these integration
> > > tests a
> > > > > > > requirement for merging.  We need to complete all the steps
> > before
> > > we
> > > > > can
> > > > > > > impose that.  I should be able to help on the global run
> > component
> > > > and
> > > > > > > supporting infrastructure.
> > > > > > >
> > > > > > > J
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jacques Nadeau
> > > > > > > CTO and Co-Founder, Dremio
> > > > > > >
> > > > > > > On Fri, Jul 24, 2015 at 1:29 PM, rahul challapalli <
> > > > > > > [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
> > > > > > >
> > > > > > > > Ramana,
> > > > > > > >
> > > > > > > > You are right. We are trying to address multiple issues here,
> > but
> > > > not
> > > > > > > with
> > > > > > > > a single solution. I am summarizing them
> > > > > > > >
> > > > > > > > 1. Tests should be visible to everyone (Implicit goal)
> > > > > > > > 2. Before applying a patch we should run tests in a clustered
> > > > > > > environment.
> > > > > > > > Parth had a suggestion(#4) in his original email.
> > > > > > > > 3. Developers should be able to debug majority of the tests
> on
> > > > their
> > > > > > > local
> > > > > > > > environment. I made a few suggestions above to this regard
> > > > > > > >
> > > > > > > > - Rahul
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jul 24, 2015 at 10:40 AM, Ramana I N <
> > [email protected] <javascript:_e(%7B%7D,'cvml','[email protected]');>
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > One important thing which we need to be clear on here is
> what
> > > are
> > > > > we
> > > > > > > > trying
> > > > > > > > > to address?
> > > > > > > > >
> > > > > > > > > I feel there are two separate issues here and I do not
> think
> > > one
> > > > > > > solution
> > > > > > > > > will fit both the issues.
> > > > > > > > >
> > > > > > > > >    1. Allowing developers to run tests on their local box
> so
> > > they
> > > > > > know
> > > > > > > > the
> > > > > > > > >    changes they have are not completely wrong.
> > > > > > > > >    2. Allowing transparency in the integration tests
> process
> > > > which
> > > > > is
> > > > > > > > >    currently a black box.
> > > > > > > > >
> > > > > > > > > 1 is needed for developers to make changes and have an idea
> > > that
> > > > > > their
> > > > > > > > > changes are not going to fail tests en masse in the
> > integration
> > > > > > suite.
> > > > > > > 2
> > > > > > > > is
> > > > > > > > > needed because its a prerequisite for changes to be
> > committed.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > > Ramana
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Jul 24, 2015 at 10:28 AM, rahul challapalli <
> > > > > > > > > [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
> > > > > > > > >
> > > > > > > > > > Ramana,
> > > > > > > > > >
> > > > > > > > > > Let me fill in more details.
> > > > > > > > > >
> > > > > > > > > > 1. Before we accept a patch we want to make sure the
> tests
> > > run
> > > > > in a
> > > > > > > > > cluster
> > > > > > > > > > environment. No exceptions here.
> > > > > > > > > > 2. We want  the contributors to be able to debug the
> > failing
> > > > > tests
> > > > > > on
> > > > > > > > > their
> > > > > > > > > > laptops in as many cases as possbile. This requires :
> > > > > > > > > >         1. Tests should run on top of a local file
> system.
> > > > (Tests
> > > > > > can
> > > > > > > > > > launch an embedded drillbit or they can connect to a
> > running
> > > > > > drillbit
> > > > > > > > > > through zookeeper)
> > > > > > > > > >         2. Running suites which require additional setup
> > > (hive,
> > > > > > hbase
> > > > > > > > > etc)
> > > > > > > > > > should be made optional and sufficient documentation
> should
> > > be
> > > > > > > provided
> > > > > > > > > for
> > > > > > > > > > enabling and disabling these tests.
> > > > > > > > > > 3. In my opinion making these new tests part of drill
> would
> > > > make
> > > > > it
> > > > > > > > > easier
> > > > > > > > > > for the developers to debug and run tests instead of
> > having a
> > > > > > > different
> > > > > > > > > > repository. But as you said it might bloat the drill
> > project
> > > > > > > > > >
> > > > > > > > > > - Rahul
> > > > > > > > > >
> > > > > > > > > > On Fri, Jul 24, 2015 at 9:42 AM, Ted Dunning <
> > > > > > [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > The Hadoop family of projects has some software that
> > > > > integrates a
> > > > > > > > > > > continuous integration system so that every time a JIRA
> > is
> > > > > marked
> > > > > > > as
> > > > > > > > > > > patch-available, the associated patch attached to the
> bug
> > > > will
> > > > > > have
> > > > > > > > > > > integration tests run against it.  I believe that there
> > has
> > > > > been
> > > > > > > some
> > > > > > > > > > > process to use git hashes instead of patches.  The CI
> > > results
> > > > > are
> > > > > > > put
> > > > > > > > > > back
> > > > > > > > > > > on the JIRA.
> > > > > > > > > > >
> > > > > > > > > > > This is done using a fairly simple set of scripts.
> > Apache
> > > > > Yetus
> > > > > > is
> > > > > > > > > just
> > > > > > > > > > > forming as a direct-to-top-level spinoff from Hadoop
> > > > > > > > > > >
> > > > > > > > > > > Proposal is here (don't be fooled by the fact that it
> > looks
> > > > > like
> > > > > > an
> > > > > > > > > > > incubation proposal):
> > > > > > > > > > >
> > > > > > > > > > > http://wiki.apache.org/incubator/YetusProposal
> > > > > > > > > > >
> > > > > > > > > > > Early code can be found here (don't guess that this is
> > very
> > > > > real
> > > > > > > > yet).
> > > > > > > > > > > More links can be found in the proposal.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > https://github.com/sekikn/pre-yetus/tree/master/precommit/docs
> > > > > > > > > > >
> > > > > > > > > > > The project has not yet been formed and there are no
> > > mailing
> > > > > > lists
> > > > > > > or
> > > > > > > > > git
> > > > > > > > > > > repo yet.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jul 24, 2015 at 9:25 AM, Ramana I N <
> > > > > [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > As someone who worked on this for a while, including
> it
> > > as
> > > > > part
> > > > > > > of
> > > > > > > > > > drill
> > > > > > > > > > > > may bloat drill a bit too much. Also not a big fan of
> > > > running
> > > > > > > > against
> > > > > > > > > > an
> > > > > > > > > > > > embedded drillbit. Does not replicate an actual
> > > production
> > > > > use
> > > > > > > > case.
> > > > > > > > > > > >
> > > > > > > > > > > > Additionally, setting up hive hbase and other
> > components
> > > > > maybe
> > > > > > > > > painful
> > > > > > > > > > > and
> > > > > > > > > > > > unnecessary for most ppl. It would deter people from
> > ever
> > > > > > > > > contributing
> > > > > > > > > > to
> > > > > > > > > > > > drill. We could spin up in memory hive and hbase but
> > > that's
> > > > > > > similar
> > > > > > > > > to
> > > > > > > > > > an
> > > > > > > > > > > > embedded drill bit. Does not replicate a production
> > > > scenario.
> > > > > > > > > > > >
> > > > > > > > > > > > Would prefer the hive way with a central Jenkins
> server
> > > > > hosted
> > > > > > on
> > > > > > > > aws
> > > > > > > > > > and
> > > > > > > > > > > > accessible to everyone.  Users should be able to
> > submit a
> > > > git
> > > > > > url
> > > > > > > > and
> > > > > > > > > > > that
> > > > > > > > > > > > should be able to deploy and fire off tests. Should
> > then
> > > > > have a
> > > > > > > way
> > > > > > > > > to
> > > > > > > > > > > > easily communicate failures to contributors and if
> > > success
> > > > > > notify
> > > > > > > > the
> > > > > > > > > > > > commiters to commit the change.
> > > > > > > > > > > >
> > > > > > > > > > > > Ps: if hive's way is open source maybe we can look
> into
> > > > reuse
> > > > > > > > rather
> > > > > > > > > > than
> > > > > > > > > > > > doing it from scratch. Esp the Jenkins and
> > configuration
> > > > > stuff.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards
> > > > > > > > > > > > Ramana
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thursday, July 23, 2015, Parth Chandra <
> > > > [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Drill devs use a set of tests that are not
> available
> > as
> > > > > part
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > > Apache
> > > > > > > > > > > > > distribution. These tests are a pre-requisite for
> all
> > > > > > commits,
> > > > > > > > but
> > > > > > > > > > are
> > > > > > > > > > > > not
> > > > > > > > > > > > > available to any contributors outside the current
> > devs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This thread is to discuss various options to make
> > these
> > > > > tests
> > > > > > > > > > > available.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Assumptions and requirements  -
> > > > > > > > > > > > > 1) A functional test (as opposed to a unit test)
> > needs
> > > to
> > > > > be
> > > > > > > > closer
> > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > > end user environment than a development
> environment.
> > As
> > > > > such,
> > > > > > > we
> > > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > > running functional tests in a cluster environment,
> > > > connect
> > > > > > > using
> > > > > > > > > > > > zookeeper
> > > > > > > > > > > > > etc.
> > > > > > > > > > > > > 2) Functional test will keep increasing in number,
> > get
> > > > more
> > > > > > > > complex
> > > > > > > > > > and
> > > > > > > > > > > > > take a longer and longer time to execute as we go
> > > along.
> > > > > > > > > > > > > 3) Some requirements are:
> > > > > > > > > > > > >     a) We want to be strict in enforcing the
> > pre-commit
> > > > > > > > > requirements,
> > > > > > > > > > > but
> > > > > > > > > > > > > not penalize the contributor who has a minor fix.
> > > > > > > > > > > > >     b) All parts of the product (especially various
> > > > > > 'certified'
> > > > > > > > > > storage
> > > > > > > > > > > > > plugins like Hive and Hbase should get tested)
> > > > > > > > > > > > >     c) It should be easy to debug issues when a
> test
> > > > fails.
> > > > > > > Tests
> > > > > > > > > > > should
> > > > > > > > > > > > > fail deterministically. If a test fails, it should
> > > always
> > > > > > fail
> > > > > > > > and
> > > > > > > > > > > always
> > > > > > > > > > > > > fail in the same way (easier said than done).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Some suggestions -
> > > > > > > > > > > > > 1) Tests should be a top-level maven module within
> > the
> > > > > drill
> > > > > > > > > project
> > > > > > > > > > > > >         a) We want  the integration tests to run as
> > > part
> > > > of
> > > > > > the
> > > > > > > > > > drill's
> > > > > > > > > > > > > maven build process
> > > > > > > > > > > > >         b) The build step for the integration-tests
> > > > module
> > > > > > > would
> > > > > > > > > > launch
> > > > > > > > > > > > an
> > > > > > > > > > > > > embedded drillbit and runs tests against it
> > > > > > > > > > > > >         c) The tests will be a separate target so
> > they
> > > > need
> > > > > > not
> > > > > > > > be
> > > > > > > > > > run
> > > > > > > > > > > > all
> > > > > > > > > > > > > the time
> > > > > > > > > > > > >  2) Tests should be divided into multiple suites
> that
> > > are
> > > > > > based
> > > > > > > > on
> > > > > > > > > > > > > components. For example a test suite for testing
> > > > datatypes
> > > > > > will
> > > > > > > > > > contain
> > > > > > > > > > > > the
> > > > > > > > > > > > > tests for various datatypes including complex
> types.
> > A
> > > > > > > > contributor
> > > > > > > > > or
> > > > > > > > > > > > > developer can then run these tests more frequently
> as
> > > an
> > > > > > issue
> > > > > > > is
> > > > > > > > > > being
> > > > > > > > > > > > > addressed and run the entire suite only once before
> > > > commit.
> > > > > > > > > > > > > 3) Provide the tests as a hosted service
> > > > > > > > > > > > > 4) Setup a bot to fire the test on an AWS cluster
> and
> > > > post
> > > > > > the
> > > > > > > > > > results
> > > > > > > > > > > to
> > > > > > > > > > > > > the JIRA  (Hive does this). Or some variant of this
> > > idea.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Some questions -
> > > > > > > > > > > > > 1) What do some other projects do?
> > > > > > > > > > > > > 2) Are there any technologies we can leverage that
> > will
> > > > > make
> > > > > > > this
> > > > > > > > > > > easier?
> > > > > > > > > > > > > 3) How do we make it easier to debug failing tests.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please feel free to question the assumptions and
> > > > > > requirements.
> > > > > > > Be
> > > > > > > > > > > > creative
> > > > > > > > > > > > > with your suggestions.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Parth
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to