Hi again, Understood, if later on we (beam) decide to put this in place, I can help a bit, since this is a subject I like, and it is clear for me that this idea can have immediate benefits (better integration tests and of course better IOs/runners).
Ismael On Thu, Jul 28, 2016 at 10:38 PM, Kenneth Knowles <[email protected]> wrote: > Hi Ismaël, > > I was just talking in general about what any project would want to do. I > don't have any specific plans. > > Kenn > > On Thu, Jul 28, 2016 at 12:53 PM, Ismaël Mejía <[email protected]> wrote: > > > Kenneth this is great news (I am talking about the addtional services), I > > was just discussing with JB the other day, about how nice it would be to > > have this kind of tests, with the right infrastructure, since we are > > working on new IOs, e.g. to test certain particular behaviors with Kafka > or > > other systems, how do the IO react to failure, etc. > > > > It is nice to know that this can be supported. Any concrete plans of how > > will to make this work ? Do you intend to deploy such systems via > > containers or just having them in some test cluster ? > > > > As Aljoscha mentions just kafka or yarn both need quite a bit of 'extra' > > dependencies at deploy time. > > > > Thanks again for this idea, > > Ismael. > > > > > > > > On Thu, Jul 28, 2016 at 6:48 PM, Aljoscha Krettek <[email protected]> > > wrote: > > > > > For Flink, Yarn is fine and I guess it's the common denominator for all > > > runners (except DataflowRunner, of course). > > > > > > @Kenn IMHO the common deployment is Kafka (running standalone, because > it > > > only works that way), which also requires Zookeeper (if I'm not > mistaken) > > > and YARN, which all runners should be able to run on. > > > > > > On Thu, 28 Jul 2016 at 18:36 Kenneth Knowles <[email protected]> > > > wrote: > > > > > > > Presumably we'll eventually also run additional services alongside > > (like > > > > Kafka) to have true integration tests for I/O connectors. What is the > > > > common deployment in this case? > > > > > > > > On Jul 28, 2016 06:35, "Amit Sela" <[email protected]> wrote: > > > > > > > > > So what would be the preferred resource manager to test Flink on ? > > > > > > > > > > On Thu, Jul 28, 2016, 16:34 Aljoscha Krettek <[email protected]> > > > > wrote: > > > > > > > > > > > Flink also has a standalone mode. > > > > > > > > > > > > On Thu, 28 Jul 2016 at 13:42 Ismaël Mejía <[email protected]> > > wrote: > > > > > > > > > > > > > Good subject, YARN is the de-facto standard at least from the > > > point > > > > of > > > > > > > view of the Big Data Distributions (Cloudera, Hortonworks, etc) > > and > > > > > Cloud > > > > > > > offers, e.g. AWS EMR, Azure HDInsight and Google Dataproc), and > > > given > > > > > > that > > > > > > > it is supported by both Spark and Flink I think it is valuable > to > > > > test > > > > > > the > > > > > > > support for YARN. The question is, should the tests be run on > > > > > > 'Standalone' > > > > > > > OR YARN' or maybe we can have tests for 'Standalone AND YARN' > ? > > > > > > > > > > > > > > Ismael. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Jul 28, 2016 at 12:24 PM, Amit Sela < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > Following a discussion I had with Kenneth and Dan here > > > > > > > > <https://github.com/apache/incubator-beam/pull/711>. I want > to > > > > raise > > > > > > the > > > > > > > > issue of which resource manager we should use for on going > > tests > > > > that > > > > > > > will > > > > > > > > run on actual clusters (on top of local/in-mem tests). > > > > > > > > If we plan to test all runners on all their supported > resource > > > > > > managers, > > > > > > > > great! But I guess this won't be the case, at least not at > the > > > > > > beginning. > > > > > > > > > > > > > > > > Spark can run it's own (Standalone Mode) resource manager, > use > > > YARN > > > > > or > > > > > > > use > > > > > > > > Mesos. According to the latest survey > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf > > > > > > > > > > > > > > > > > by > > > > > > > > Databricks Standalone is in the lead (48%), with YARN tailing > > it > > > > > > > > (40%) while Mesos looks like the least favourite. > > > > > > > > For Spark, I'd vote for Standalone as it is the most popular > > use > > > > > case + > > > > > > > it > > > > > > > > avoids the additional complexity of maintaining YARN on this > > > > cluster. > > > > > > > > Having said that, AFAIK Flink is a "first-class" YARN citizen > > > > (right > > > > > ?) > > > > > > > and > > > > > > > > I don't know what available resource managers can be used by > > > other > > > > > > > runners, > > > > > > > > so I think runner authors should give their input here. > > > > > > > > > > > > > > > > *Summary:* > > > > > > > > *Spark* - StandaloneMode or YARN (in that order). > > > > > > > > *Flink * - ? > > > > > > > > *Others* - ? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Amit > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
