+1 (non-binding) for Option C Michael
On Fri, Apr 14, 2017 at 1:50 PM, <[email protected]> wrote: > +1 > > > A 2017-04-14 17:59, Austin Leahy escrigué: > > I think that C is the strong solution, getting the ingest really strong is >> going to lower barriers to adoption. Doing it in Python will open up the >> ingest portion of the project to include many more developers. >> >> Before it comes up I would like to throw the following on the pile... >> Major >> python projects django/flash, others are dropping 2.x support in releases >> scheduled in the next 6 to 8 months. Hadoop projects in general tend to >> lag >> in modern python support, lets please build this in 3.5 so that we don't >> have to immediately expect a rebuild in the pipeline. >> >> -Vote C >> >> Thanks Nate >> >> Austin >> >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> wrote: >> >> I really like option C because it gives a lot of flexibility for ingest >>> (python vs scala) but still has the robust spark streaming backend for >>> performance. >>> >>> Thanks for putting this together Nate. >>> >>> Alan >>> >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < >>> [email protected]> wrote: >>> >>> > I agree. We should continue making the existing stack more mature at >>> > this point. Maybe if we have enough community support we can add >>> > additional datastores. >>> > >>> > Chokha. >>> > >>> > >>> > On 4/14/17 11:10 AM, [email protected] wrote: >>> > > Hi Kant, >>> > > >>> > > >>> > > YARN is the standard scheduler in Hadoop. If you're using Hive+Spark, >>> > > then sure you'll have YARN. >>> > > >>> > > Haven't seen any HIVE on Mesos so far. As said, Spot is based on a >>> > > quite standard Hadoop stack and I wouldn't switch too many pieces >>> yet. >>> > > >>> > > In most Opensource projects you start relying on a well-known stack >>> > > and then you begin to support other DB backends once it's quite >>> > > mature. Think in the loads of LAMP apps which haven't been ported >>> away >>> > > from MySQL yet. >>> > > >>> > > In any case, you'll need a high performance SQL + Massive Storage + >>> > > Machine Learning + Massive Ingestion, and... ATM, that can be only >>> > > provided by Hadoop. >>> > > >>> > > Regards! >>> > > >>> > > Kenneth >>> > > >>> > > A 2017-04-14 12:56, kant kodali escrigué: >>> > >> Hi Kenneth, >>> > >> >>> > >> Thanks for the response. I think you made a case for HDFS however >>> > >> users >>> > >> may want to use S3 or some other FS in which case they can use >>> Auxilio >>> > >> (hoping that there are no changes needed within Spot in which case I >>> can >>> > >> agree to that). for example, Netflix stores all there data into S3 >>> > >> >>> > >> The distributed sql query engine I would say should be pluggable >>> with >>> > >> whatever user may want to use and there a bunch of them out there. >>> sure >>> > >> Impala is better than hive but what if users are already using >>> something >>> > >> else like Drill or Presto? >>> > >> >>> > >> Me personally, would not assume that users are willing to deploy all >>> of >>> > >> that and make their existing stack more complicated at very least I >>> > >> would >>> > >> say it is a uphill battle. Things have been changing rapidly in Big >>> data >>> > >> space so whatever we think is standard won't be standard anymore but >>> > >> importantly there shouldn't be any reason why we shouldn't be >>> flexible >>> > >> right. >>> > >> >>> > >> Also I am not sure why only YARN? why not make that also more >>> > >> flexible so >>> > >> users can pick Mesos or standalone. >>> > >> >>> > >> I think Flexibility is a key for a wide adoption rather than the >>> tightly >>> > >> coupled architecture. >>> > >> >>> > >> Thanks! >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <[email protected] >>> > >>> > >> wrote: >>> > >> >>> > >>> PS: you need a big data platform to be able to collect all those >>> > >>> netflows >>> > >>> and logs. >>> > >>> >>> > >>> Spot isn't intended for SMBs, that's clear, then you need loads of >>> > >>> data to >>> > >>> get ML working properly, and somewhere to run those algorithms. >>> That >>> is >>> > >>> Hadoop. >>> > >>> >>> > >>> Regards! >>> > >>> >>> > >>> Kenneth >>> > >>> >>> > >>> >>> > >>> >>> > >>> Sent from my Mi phone >>> > >>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM wrote: >>> > >>> >>> > >>> Hi, >>> > >>> >>> > >>> Thanks for starting this thread. Here is my feedback. >>> > >>> >>> > >>> I somehow think the architecture is too complicated for wide >>> adoption >>> > >>> since >>> > >>> it requires to install the following. >>> > >>> >>> > >>> HDFS. >>> > >>> HIVE. >>> > >>> IMPALA. >>> > >>> KAFKA. >>> > >>> SPARK (YARN). >>> > >>> YARN. >>> > >>> Zookeeper. >>> > >>> >>> > >>> Currently there are way too many dependencies that discourages lot >>> of >>> > >>> users >>> > >>> from using it because they have to go through deployment of all >>> that >>> > >>> required software. I think for wide option we should minimize the >>> > >>> dependencies and have more pluggable architecture. for example I am >>> not >>> > >>> sure why HIVE & IMPALA both are required? why not just use Spark >>> SQL >>> > >>> since >>> > >>> its already dependency or say users may want to use their own >>> > >>> distributed >>> > >>> query engine they like such as Apache Drill or something else. we >>> > >>> should >>> > >>> be >>> > >>> flexible enough to provide that option >>> > >>> >>> > >>> Also, I see that HDFS is used such that collectors can receive file >>> > >>> path's >>> > >>> through Kafka and be able to read a file. How big are these files ? >>> > >>> Do we >>> > >>> really need HDFS for this? Why not provide more ways to send data >>> > >>> such as >>> > >>> sending data directly through Kafka or say just leaving up to the >>> > >>> user to >>> > >>> specify the file location as an argument to collector process >>> > >>> >>> > >>> Finally, I learnt that to generate Net flow data one would require >>> a >>> > >>> specific hardware. This really means Apache Spot is not meant for >>> > >>> everyone. >>> > >>> I thought Apache Spot can be used to analyze the network traffic of >>> any >>> > >>> machine but if it requires a specific hard then I think it is >>> > >>> targeted for >>> > >>> specific group of people. >>> > >>> >>> > >>> The real strength of Apache Spot should mainly be just analyzing >>> > >>> network >>> > >>> traffic through ML. >>> > >>> >>> > >>> Thanks! >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < >>> > >>> [email protected]> wrote: >>> > >>> >>> > >>> > Thanks, Nate, >>> > >>> > >>> > >>> > Nate. >>> > >>> > >>> > >>> > >>> > >>> > -----Original Message----- >>> > >>> > From: Nate Smith [mailto:[email protected]] >>> > >>> > Sent: Thursday, April 13, 2017 4:26 PM >>> > >>> > To: [email protected] >>> > >>> > Cc: [email protected]; >>> [email protected] >>> > >>> > Subject: Re: [Discuss] - Future plans for Spot-ingest >>> > >>> > >>> > >>> > I was really hoping it came through ok, >>> > >>> > Oh well :) >>> > >>> > Here’s an image form: >>> > >>> > http://imgur.com/a/DUDsD >>> > >>> > >>> > >>> > >>> > >>> > > On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < >>> > >>> > [email protected]> wrote: >>> > >>> > > >>> > >>> > > The diagram became garbled in the text format. >>> > >>> > > Could you resend it as a pdf? >>> > >>> > > >>> > >>> > > Thanks, >>> > >>> > > Nate >>> > >>> > > >>> > >>> > > -----Original Message----- >>> > >>> > > From: Nathanael Smith [mailto:[email protected]] >>> > >>> > > Sent: Thursday, April 13, 2017 4:01 PM >>> > >>> > > To: [email protected]; >>> > >>> [email protected]; >>> > >>> > [email protected] >>> > >>> > > Subject: [Discuss] - Future plans for Spot-ingest >>> > >>> > > >>> > >>> > > How would you like to see Spot-ingest change? >>> > >>> > > >>> > >>> > > A. continue development on the Python Master/Worker with focus >>> on >>> > >>> > performance / error handling / logging B. Develop Scala based >>> > >>> ingest to >>> > >>> be >>> > >>> > inline with code base from ingest, ml, to OA (UI to continue >>> being >>> > >>> > ipython/JS) C. Python ingest Worker with Scala based Spark code >>> for >>> > >>> > normalization and input into DB >>> > >>> > > >>> > >>> > > Including the high level diagram: >>> > >>> > > +----------------------------------------------------------- >>> > >>> > -------------------------------+ >>> > >>> > > | +--------------------------+ >>> > >>> > +-----------------+ | >>> > >>> > > | | Master | A. B. C. >>> | >>> > >>> > Worker | | >>> > >>> > > | | A. Python +---------------+ A. >>> > >>> | A. >>> > >>> > Python | | >>> > >>> > > | | B. Scala | | >>> +-------------> >>> > >>> > +----+ | >>> > >>> > > | | C. Python | | | >>> | >>> > >>> > | | | >>> > >>> > > | +---^------+---------------+ | | >>> > >>> > +-----------------+ | | >>> > >>> > > | | | | | >>> > >>> > | | >>> > >>> > > | | | | | >>> > >>> > | | >>> > >>> > > | | +Note--------------+ | | >>> > >>> > +-----------------+ | | >>> > >>> > > | | |Running on a | | | >>> | >>> > >>> Spark >>> > >>> > Streaming | | | >>> > >>> > > | | |worker node in | | | B. C. >>> > >>> | B. >>> > >>> > Scala | | | >>> > >>> > > | | |the Hadoop cluster| | | >>> > >>> +--------> C. >>> > >>> > Scala +-+ | | >>> > >>> > > | | +------------------+ | | | >>> | >>> > >>> > | | | | >>> > >>> > > | A.| | | | >>> > >>> > +-----------------+ | | | >>> > >>> > > | B.| | | | >>> > >>> > | | | >>> > >>> > > | C.| | | | >>> > >>> > | | | >>> > >>> > > | +----------------------+ +-v------+----+----+-+ >>> > >>> > +--------------v--v-+ | >>> > >>> > > | | | | >>> > >>> | | >>> > >>> > | | >>> > >>> > > | | Local FS: | | hdfs >>> > >>> | | >>> > >>> > Hive / Impala | | >>> > >>> > > | | - Binary/Text | | >>> > >>> | | >>> > >>> > - Parquet - | | >>> > >>> > > | | Log files - | | >>> > >>> | | >>> > >>> > | | >>> > >>> > > | | | | >>> > >>> | | >>> > >>> > | | >>> > >>> > > | +----------------------+ +--------------------+ >>> > >>> > +-------------------+ | >>> > >>> > > +----------------------------------------------------------- >>> > >>> > -------------------------------+ >>> > >>> > > >>> > >>> > > Please let me know your thoughts, >>> > >>> > > >>> > >>> > > - Nathanael >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > >>> > >>> > >>> > >>> >>> > >>> >>> > > >>> > >>> > >>> >>> > -- Michael Ridley <[email protected]> office: (650) 352-1337 mobile: (571) 438-2420 Senior Solutions Architect Cloudera, Inc.
