Hi Nate, Thanks for starting this. I do feel strongly that it's best to keep the workers in a JVM based language like Scala. I don't have a strong feeling about the master but for reasons pointed above by others, it may make sense to keep the master in Python.
So, Option C would be the vote from me as well. On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <[email protected]> wrote: > Option C is to use python on the front end of ingest pipeline and > spark/scala on the back end. > > Option A uses python workers on the backend > > Option B uses all scala. > > > > -----Original Message----- > From: kant kodali [mailto:[email protected]] > Sent: Friday, April 14, 2017 9:53 AM > To: [email protected] > Subject: Re: [Discuss] - Future plans for Spot-ingest > > What is option C ? am I missing an email or something? > > On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai < > [email protected]> wrote: > > > +1 for Python 3.x > > > > > > > > On 4/14/2017 11:59 AM, Austin Leahy wrote: > > > >> I think that C is the strong solution, getting the ingest really > >> strong is going to lower barriers to adoption. Doing it in Python > >> will open up the ingest portion of the project to include many more > developers. > >> > >> Before it comes up I would like to throw the following on the pile... > >> Major > >> python projects django/flash, others are dropping 2.x support in > >> releases scheduled in the next 6 to 8 months. Hadoop projects in > >> general tend to lag in modern python support, lets please build this > >> in 3.5 so that we don't have to immediately expect a rebuild in the > >> pipeline. > >> > >> -Vote C > >> > >> Thanks Nate > >> > >> Austin > >> > >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> wrote: > >> > >> I really like option C because it gives a lot of flexibility for > >> ingest > >>> (python vs scala) but still has the robust spark streaming backend > >>> for performance. > >>> > >>> Thanks for putting this together Nate. > >>> > >>> Alan > >>> > >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < > >>> [email protected]> wrote: > >>> > >>> I agree. We should continue making the existing stack more mature at > >>>> this point. Maybe if we have enough community support we can add > >>>> additional datastores. > >>>> > >>>> Chokha. > >>>> > >>>> > >>>> On 4/14/17 11:10 AM, [email protected] wrote: > >>>> > >>>>> Hi Kant, > >>>>> > >>>>> > >>>>> YARN is the standard scheduler in Hadoop. If you're using > >>>>> Hive+Spark, then sure you'll have YARN. > >>>>> > >>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based on a > >>>>> quite standard Hadoop stack and I wouldn't switch too many pieces > yet. > >>>>> > >>>>> In most Opensource projects you start relying on a well-known > >>>>> stack and then you begin to support other DB backends once it's > >>>>> quite mature. Think in the loads of LAMP apps which haven't been > >>>>> ported away from MySQL yet. > >>>>> > >>>>> In any case, you'll need a high performance SQL + Massive Storage > >>>>> + Machine Learning + Massive Ingestion, and... ATM, that can be > >>>>> only provided by Hadoop. > >>>>> > >>>>> Regards! > >>>>> > >>>>> Kenneth > >>>>> > >>>>> A 2017-04-14 12:56, kant kodali escrigué: > >>>>> > >>>>>> Hi Kenneth, > >>>>>> > >>>>>> Thanks for the response. I think you made a case for HDFS > >>>>>> however users may want to use S3 or some other FS in which case > >>>>>> they can use Auxilio (hoping that there are no changes needed > >>>>>> within Spot in which case I > >>>>>> > >>>>> can > >>> > >>>> agree to that). for example, Netflix stores all there data into S3 > >>>>>> > >>>>>> The distributed sql query engine I would say should be pluggable > >>>>>> with whatever user may want to use and there a bunch of them out > there. > >>>>>> > >>>>> sure > >>> > >>>> Impala is better than hive but what if users are already using > >>>>>> > >>>>> something > >>> > >>>> else like Drill or Presto? > >>>>>> > >>>>>> Me personally, would not assume that users are willing to deploy > >>>>>> all > >>>>>> > >>>>> of > >>> > >>>> that and make their existing stack more complicated at very least I > >>>>>> would > >>>>>> say it is a uphill battle. Things have been changing rapidly in > >>>>>> Big > >>>>>> > >>>>> data > >>> > >>>> space so whatever we think is standard won't be standard anymore > >>>> but > >>>>>> importantly there shouldn't be any reason why we shouldn't be > >>>>>> flexible right. > >>>>>> > >>>>>> Also I am not sure why only YARN? why not make that also more > >>>>>> flexible so users can pick Mesos or standalone. > >>>>>> > >>>>>> I think Flexibility is a key for a wide adoption rather than the > >>>>>> > >>>>> tightly > >>> > >>>> coupled architecture. > >>>>>> > >>>>>> Thanks! > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza > >>>>>> <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>> PS: you need a big data platform to be able to collect all those > >>>>>>> netflows > >>>>>>> and logs. > >>>>>>> > >>>>>>> Spot isn't intended for SMBs, that's clear, then you need loads > >>>>>>> of data to get ML working properly, and somewhere to run those > >>>>>>> algorithms. That > >>>>>>> > >>>>>> is > >>> > >>>> Hadoop. > >>>>>>> > >>>>>>> Regards! > >>>>>>> > >>>>>>> Kenneth > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Sent from my Mi phone > >>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM wrote: > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> Thanks for starting this thread. Here is my feedback. > >>>>>>> > >>>>>>> I somehow think the architecture is too complicated for wide > >>>>>>> adoption since it requires to install the following. > >>>>>>> > >>>>>>> HDFS. > >>>>>>> HIVE. > >>>>>>> IMPALA. > >>>>>>> KAFKA. > >>>>>>> SPARK (YARN). > >>>>>>> YARN. > >>>>>>> Zookeeper. > >>>>>>> > >>>>>>> Currently there are way too many dependencies that discourages > >>>>>>> lot of users from using it because they have to go through > >>>>>>> deployment of all that required software. I think for wide > >>>>>>> option we should minimize the dependencies and have more > >>>>>>> pluggable architecture. for example I am > >>>>>>> > >>>>>> not > >>> > >>>> sure why HIVE & IMPALA both are required? why not just use Spark > >>>> SQL > >>>>>>> since > >>>>>>> its already dependency or say users may want to use their own > >>>>>>> distributed query engine they like such as Apache Drill or > >>>>>>> something else. we should be flexible enough to provide that > >>>>>>> option > >>>>>>> > >>>>>>> Also, I see that HDFS is used such that collectors can receive > >>>>>>> file path's through Kafka and be able to read a file. How big > >>>>>>> are these files ? > >>>>>>> Do we > >>>>>>> really need HDFS for this? Why not provide more ways to send > >>>>>>> data such as sending data directly through Kafka or say just > >>>>>>> leaving up to the user to specify the file location as an > >>>>>>> argument to collector process > >>>>>>> > >>>>>>> Finally, I learnt that to generate Net flow data one would > >>>>>>> require a specific hardware. This really means Apache Spot is > >>>>>>> not meant for everyone. > >>>>>>> I thought Apache Spot can be used to analyze the network traffic > >>>>>>> of > >>>>>>> > >>>>>> any > >>> > >>>> machine but if it requires a specific hard then I think it is > >>>>>>> targeted for > >>>>>>> specific group of people. > >>>>>>> > >>>>>>> The real strength of Apache Spot should mainly be just analyzing > >>>>>>> network traffic through ML. > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>> Thanks, Nate, > >>>>>>>> > >>>>>>>> Nate. > >>>>>>>> > >>>>>>>> > >>>>>>>> -----Original Message----- > >>>>>>>> From: Nate Smith [mailto:[email protected]] > >>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM > >>>>>>>> To: [email protected] > >>>>>>>> Cc: [email protected]; > >>>>>>>> > >>>>>>> [email protected] > >>> > >>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > >>>>>>>> > >>>>>>>> I was really hoping it came through ok, Oh well :) Here’s an > >>>>>>>> image form: > >>>>>>>> http://imgur.com/a/DUDsD > >>>>>>>> > >>>>>>>> > >>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < > >>>>>>>>> > >>>>>>>> [email protected]> wrote: > >>>>>>>> > >>>>>>>>> The diagram became garbled in the text format. > >>>>>>>>> Could you resend it as a pdf? > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Nate > >>>>>>>>> > >>>>>>>>> -----Original Message----- > >>>>>>>>> From: Nathanael Smith [mailto:[email protected]] > >>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM > >>>>>>>>> To: [email protected]; > >>>>>>>>> > >>>>>>>> [email protected]; > >>>>>>> > >>>>>>>> [email protected] > >>>>>>>> > >>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest > >>>>>>>>> > >>>>>>>>> How would you like to see Spot-ingest change? > >>>>>>>>> > >>>>>>>>> A. continue development on the Python Master/Worker with focus > >>>>>>>>> on > >>>>>>>>> > >>>>>>>> performance / error handling / logging B. Develop Scala based > >>>>>>>> > >>>>>>> ingest to > >>>>>>> be > >>>>>>> > >>>>>>>> inline with code base from ingest, ml, to OA (UI to continue > >>>>>>>> being > >>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark code > >>>>>>>> for normalization and input into DB > >>>>>>>> > >>>>>>>>> Including the high level diagram: > >>>>>>>>> +----------------------------------------------------------- > >>>>>>>>> > >>>>>>>> -------------------------------+ > >>>>>>>> > >>>>>>>>> | +--------------------------+ > >>>>>>>>> > >>>>>>>> +-----------------+ | > >>>>>>>> > >>>>>>>>> | | Master | A. B. C. | > >>>>>>>>> > >>>>>>>> Worker | | > >>>>>>>> > >>>>>>>>> | | A. Python +---------------+ A. > >>>>>>>>> > >>>>>>>> | A. > >>>>>>> > >>>>>>>> Python | | > >>>>>>>> > >>>>>>>>> | | B. Scala | | +-------------> > >>>>>>>>> > >>>>>>>> +----+ | > >>>>>>>> > >>>>>>>>> | | C. Python | | | | > >>>>>>>>> > >>>>>>>> | | | > >>>>>>>> > >>>>>>>>> | +---^------+---------------+ | | > >>>>>>>>> > >>>>>>>> +-----------------+ | | > >>>>>>>> > >>>>>>>>> | | | | | > >>>>>>>>> > >>>>>>>> | | > >>>>>>>> > >>>>>>>>> | | | | | > >>>>>>>>> > >>>>>>>> | | > >>>>>>>> > >>>>>>>>> | | +Note--------------+ | | > >>>>>>>>> > >>>>>>>> +-----------------+ | | > >>>>>>>> > >>>>>>>>> | | |Running on a | | | | > >>>>>>>>> > >>>>>>>> Spark > >>>>>>> > >>>>>>>> Streaming | | | > >>>>>>>> > >>>>>>>>> | | |worker node in | | | B. C. > >>>>>>>>> > >>>>>>>> | B. > >>>>>>> > >>>>>>>> Scala | | | > >>>>>>>> > >>>>>>>>> | | |the Hadoop cluster| | | > >>>>>>>>> > >>>>>>>> +--------> C. > >>>>>>> > >>>>>>>> Scala +-+ | | > >>>>>>>> > >>>>>>>>> | | +------------------+ | | | | > >>>>>>>>> > >>>>>>>> | | | | > >>>>>>>> > >>>>>>>>> | A.| | | | > >>>>>>>>> > >>>>>>>> +-----------------+ | | | > >>>>>>>> > >>>>>>>>> | B.| | | | > >>>>>>>>> > >>>>>>>> | | | > >>>>>>>> > >>>>>>>>> | C.| | | | > >>>>>>>>> > >>>>>>>> | | | > >>>>>>>> > >>>>>>>>> | +----------------------+ +-v------+----+----+-+ > >>>>>>>>> > >>>>>>>> +--------------v--v-+ | > >>>>>>>> > >>>>>>>>> | | | | > >>>>>>>>> > >>>>>>>> | | > >>>>>>> > >>>>>>>> | | > >>>>>>>> > >>>>>>>>> | | Local FS: | | hdfs > >>>>>>>>> > >>>>>>>> | | > >>>>>>> > >>>>>>>> Hive / Impala | | > >>>>>>>> > >>>>>>>>> | | - Binary/Text | | > >>>>>>>>> > >>>>>>>> | | > >>>>>>> > >>>>>>>> - Parquet - | | > >>>>>>>> > >>>>>>>>> | | Log files - | | > >>>>>>>>> > >>>>>>>> | | > >>>>>>> > >>>>>>>> | | > >>>>>>>> > >>>>>>>>> | | | | > >>>>>>>>> > >>>>>>>> | | > >>>>>>> > >>>>>>>> | | > >>>>>>>> > >>>>>>>>> | +----------------------+ +--------------------+ > >>>>>>>>> > >>>>>>>> +-------------------+ | > >>>>>>>> > >>>>>>>>> +----------------------------------------------------------- > >>>>>>>>> > >>>>>>>> -------------------------------+ > >>>>>>>> > >>>>>>>>> Please let me know your thoughts, > >>>>>>>>> > >>>>>>>>> - Nathanael > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>> > > >
