Option C is to use python on the front end of ingest pipeline and spark/scala on the back end.
Option A uses python workers on the backend Option B uses all scala. -----Original Message----- From: kant kodali [mailto:[email protected]] Sent: Friday, April 14, 2017 9:53 AM To: [email protected] Subject: Re: [Discuss] - Future plans for Spot-ingest What is option C ? am I missing an email or something? On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai < [email protected]> wrote: > +1 for Python 3.x > > > > On 4/14/2017 11:59 AM, Austin Leahy wrote: > >> I think that C is the strong solution, getting the ingest really >> strong is going to lower barriers to adoption. Doing it in Python >> will open up the ingest portion of the project to include many more >> developers. >> >> Before it comes up I would like to throw the following on the pile... >> Major >> python projects django/flash, others are dropping 2.x support in >> releases scheduled in the next 6 to 8 months. Hadoop projects in >> general tend to lag in modern python support, lets please build this >> in 3.5 so that we don't have to immediately expect a rebuild in the >> pipeline. >> >> -Vote C >> >> Thanks Nate >> >> Austin >> >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> wrote: >> >> I really like option C because it gives a lot of flexibility for >> ingest >>> (python vs scala) but still has the robust spark streaming backend >>> for performance. >>> >>> Thanks for putting this together Nate. >>> >>> Alan >>> >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < >>> [email protected]> wrote: >>> >>> I agree. We should continue making the existing stack more mature at >>>> this point. Maybe if we have enough community support we can add >>>> additional datastores. >>>> >>>> Chokha. >>>> >>>> >>>> On 4/14/17 11:10 AM, [email protected] wrote: >>>> >>>>> Hi Kant, >>>>> >>>>> >>>>> YARN is the standard scheduler in Hadoop. If you're using >>>>> Hive+Spark, then sure you'll have YARN. >>>>> >>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based on a >>>>> quite standard Hadoop stack and I wouldn't switch too many pieces yet. >>>>> >>>>> In most Opensource projects you start relying on a well-known >>>>> stack and then you begin to support other DB backends once it's >>>>> quite mature. Think in the loads of LAMP apps which haven't been >>>>> ported away from MySQL yet. >>>>> >>>>> In any case, you'll need a high performance SQL + Massive Storage >>>>> + Machine Learning + Massive Ingestion, and... ATM, that can be >>>>> only provided by Hadoop. >>>>> >>>>> Regards! >>>>> >>>>> Kenneth >>>>> >>>>> A 2017-04-14 12:56, kant kodali escrigué: >>>>> >>>>>> Hi Kenneth, >>>>>> >>>>>> Thanks for the response. I think you made a case for HDFS >>>>>> however users may want to use S3 or some other FS in which case >>>>>> they can use Auxilio (hoping that there are no changes needed >>>>>> within Spot in which case I >>>>>> >>>>> can >>> >>>> agree to that). for example, Netflix stores all there data into S3 >>>>>> >>>>>> The distributed sql query engine I would say should be pluggable >>>>>> with whatever user may want to use and there a bunch of them out there. >>>>>> >>>>> sure >>> >>>> Impala is better than hive but what if users are already using >>>>>> >>>>> something >>> >>>> else like Drill or Presto? >>>>>> >>>>>> Me personally, would not assume that users are willing to deploy >>>>>> all >>>>>> >>>>> of >>> >>>> that and make their existing stack more complicated at very least I >>>>>> would >>>>>> say it is a uphill battle. Things have been changing rapidly in >>>>>> Big >>>>>> >>>>> data >>> >>>> space so whatever we think is standard won't be standard anymore >>>> but >>>>>> importantly there shouldn't be any reason why we shouldn't be >>>>>> flexible right. >>>>>> >>>>>> Also I am not sure why only YARN? why not make that also more >>>>>> flexible so users can pick Mesos or standalone. >>>>>> >>>>>> I think Flexibility is a key for a wide adoption rather than the >>>>>> >>>>> tightly >>> >>>> coupled architecture. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza >>>>>> <[email protected]> >>>>>> wrote: >>>>>> >>>>>> PS: you need a big data platform to be able to collect all those >>>>>>> netflows >>>>>>> and logs. >>>>>>> >>>>>>> Spot isn't intended for SMBs, that's clear, then you need loads >>>>>>> of data to get ML working properly, and somewhere to run those >>>>>>> algorithms. That >>>>>>> >>>>>> is >>> >>>> Hadoop. >>>>>>> >>>>>>> Regards! >>>>>>> >>>>>>> Kenneth >>>>>>> >>>>>>> >>>>>>> >>>>>>> Sent from my Mi phone >>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Thanks for starting this thread. Here is my feedback. >>>>>>> >>>>>>> I somehow think the architecture is too complicated for wide >>>>>>> adoption since it requires to install the following. >>>>>>> >>>>>>> HDFS. >>>>>>> HIVE. >>>>>>> IMPALA. >>>>>>> KAFKA. >>>>>>> SPARK (YARN). >>>>>>> YARN. >>>>>>> Zookeeper. >>>>>>> >>>>>>> Currently there are way too many dependencies that discourages >>>>>>> lot of users from using it because they have to go through >>>>>>> deployment of all that required software. I think for wide >>>>>>> option we should minimize the dependencies and have more >>>>>>> pluggable architecture. for example I am >>>>>>> >>>>>> not >>> >>>> sure why HIVE & IMPALA both are required? why not just use Spark >>>> SQL >>>>>>> since >>>>>>> its already dependency or say users may want to use their own >>>>>>> distributed query engine they like such as Apache Drill or >>>>>>> something else. we should be flexible enough to provide that >>>>>>> option >>>>>>> >>>>>>> Also, I see that HDFS is used such that collectors can receive >>>>>>> file path's through Kafka and be able to read a file. How big >>>>>>> are these files ? >>>>>>> Do we >>>>>>> really need HDFS for this? Why not provide more ways to send >>>>>>> data such as sending data directly through Kafka or say just >>>>>>> leaving up to the user to specify the file location as an >>>>>>> argument to collector process >>>>>>> >>>>>>> Finally, I learnt that to generate Net flow data one would >>>>>>> require a specific hardware. This really means Apache Spot is >>>>>>> not meant for everyone. >>>>>>> I thought Apache Spot can be used to analyze the network traffic >>>>>>> of >>>>>>> >>>>>> any >>> >>>> machine but if it requires a specific hard then I think it is >>>>>>> targeted for >>>>>>> specific group of people. >>>>>>> >>>>>>> The real strength of Apache Spot should mainly be just analyzing >>>>>>> network traffic through ML. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> Thanks, Nate, >>>>>>>> >>>>>>>> Nate. >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Nate Smith [mailto:[email protected]] >>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM >>>>>>>> To: [email protected] >>>>>>>> Cc: [email protected]; >>>>>>>> >>>>>>> [email protected] >>> >>>> Subject: Re: [Discuss] - Future plans for Spot-ingest >>>>>>>> >>>>>>>> I was really hoping it came through ok, Oh well :) Here’s an >>>>>>>> image form: >>>>>>>> http://imgur.com/a/DUDsD >>>>>>>> >>>>>>>> >>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < >>>>>>>>> >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> The diagram became garbled in the text format. >>>>>>>>> Could you resend it as a pdf? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Nate >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Nathanael Smith [mailto:[email protected]] >>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM >>>>>>>>> To: [email protected]; >>>>>>>>> >>>>>>>> [email protected]; >>>>>>> >>>>>>>> [email protected] >>>>>>>> >>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest >>>>>>>>> >>>>>>>>> How would you like to see Spot-ingest change? >>>>>>>>> >>>>>>>>> A. continue development on the Python Master/Worker with focus >>>>>>>>> on >>>>>>>>> >>>>>>>> performance / error handling / logging B. Develop Scala based >>>>>>>> >>>>>>> ingest to >>>>>>> be >>>>>>> >>>>>>>> inline with code base from ingest, ml, to OA (UI to continue >>>>>>>> being >>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark code >>>>>>>> for normalization and input into DB >>>>>>>> >>>>>>>>> Including the high level diagram: >>>>>>>>> +----------------------------------------------------------- >>>>>>>>> >>>>>>>> -------------------------------+ >>>>>>>> >>>>>>>>> | +--------------------------+ >>>>>>>>> >>>>>>>> +-----------------+ | >>>>>>>> >>>>>>>>> | | Master | A. B. C. | >>>>>>>>> >>>>>>>> Worker | | >>>>>>>> >>>>>>>>> | | A. Python +---------------+ A. >>>>>>>>> >>>>>>>> | A. >>>>>>> >>>>>>>> Python | | >>>>>>>> >>>>>>>>> | | B. Scala | | +-------------> >>>>>>>>> >>>>>>>> +----+ | >>>>>>>> >>>>>>>>> | | C. Python | | | | >>>>>>>>> >>>>>>>> | | | >>>>>>>> >>>>>>>>> | +---^------+---------------+ | | >>>>>>>>> >>>>>>>> +-----------------+ | | >>>>>>>> >>>>>>>>> | | | | | >>>>>>>>> >>>>>>>> | | >>>>>>>> >>>>>>>>> | | | | | >>>>>>>>> >>>>>>>> | | >>>>>>>> >>>>>>>>> | | +Note--------------+ | | >>>>>>>>> >>>>>>>> +-----------------+ | | >>>>>>>> >>>>>>>>> | | |Running on a | | | | >>>>>>>>> >>>>>>>> Spark >>>>>>> >>>>>>>> Streaming | | | >>>>>>>> >>>>>>>>> | | |worker node in | | | B. C. >>>>>>>>> >>>>>>>> | B. >>>>>>> >>>>>>>> Scala | | | >>>>>>>> >>>>>>>>> | | |the Hadoop cluster| | | >>>>>>>>> >>>>>>>> +--------> C. >>>>>>> >>>>>>>> Scala +-+ | | >>>>>>>> >>>>>>>>> | | +------------------+ | | | | >>>>>>>>> >>>>>>>> | | | | >>>>>>>> >>>>>>>>> | A.| | | | >>>>>>>>> >>>>>>>> +-----------------+ | | | >>>>>>>> >>>>>>>>> | B.| | | | >>>>>>>>> >>>>>>>> | | | >>>>>>>> >>>>>>>>> | C.| | | | >>>>>>>>> >>>>>>>> | | | >>>>>>>> >>>>>>>>> | +----------------------+ +-v------+----+----+-+ >>>>>>>>> >>>>>>>> +--------------v--v-+ | >>>>>>>> >>>>>>>>> | | | | >>>>>>>>> >>>>>>>> | | >>>>>>> >>>>>>>> | | >>>>>>>> >>>>>>>>> | | Local FS: | | hdfs >>>>>>>>> >>>>>>>> | | >>>>>>> >>>>>>>> Hive / Impala | | >>>>>>>> >>>>>>>>> | | - Binary/Text | | >>>>>>>>> >>>>>>>> | | >>>>>>> >>>>>>>> - Parquet - | | >>>>>>>> >>>>>>>>> | | Log files - | | >>>>>>>>> >>>>>>>> | | >>>>>>> >>>>>>>> | | >>>>>>>> >>>>>>>>> | | | | >>>>>>>>> >>>>>>>> | | >>>>>>> >>>>>>>> | | >>>>>>>> >>>>>>>>> | +----------------------+ +--------------------+ >>>>>>>>> >>>>>>>> +-------------------+ | >>>>>>>> >>>>>>>>> +----------------------------------------------------------- >>>>>>>>> >>>>>>>> -------------------------------+ >>>>>>>> >>>>>>>>> Please let me know your thoughts, >>>>>>>>> >>>>>>>>> - Nathanael >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>> >
