Without having given it too terribly much thought, that seems like an OK approach.
Michael On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <[email protected]> wrote: > I think the question is rather we can write the data generically to HDFS > as parquet without the use of hive/impala? > > Today we write parquet data using the hive/mapreduce method. > As part of the redesign i’d like to use libraries for this as opposed to a > hadoop dependency. > I think it would be preferred to use the python master to write the data > into the format we want, then do normalization of the data in spark > streaming. > Any thoughts? > > - Nathanael > > > > > On Apr 17, 2017, at 11:08 AM, Michael Ridley <[email protected]> > wrote: > > > > I had thought that the plan was to write the data in Parquet in HDFS > > ultimately. > > > > Michael > > > > On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]> > wrote: > > > >> Hi Mark, > >> > >> Thank you so much for hearing my argument. And I definetly understand > that > >> you guys have bunch of things to do. My only concern is that I hope it > >> doesn't take too long too support other backends. For example @Kenneth > had > >> given an example of LAMP stack had not moved away from mysql yet which > >> essentially means its probably a decade ? I see that in the current > >> architecture the results from with python multiprocessing or Spark > >> Streaming are written back to HDFS and If so, can we write them in > parquet > >> format ? such that users should be able to plug in any query engine but > >> again I am not pushing you guys to do this right away or anything just > >> seeing if there a way for me to get started in parallel and if not > >> feasible, its fine I just wanted to share my 2 cents and I am glad my > >> argument is heard! > >> > >> Thanks much! > >> > >> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> wrote: > >> > >>> Hi Kant, > >>> Just wanted to make sure you don't feel like we are ignoring your > >>> comment:-) I hear you about pluggability. > >>> > >>> The design can and should be pluggable but the project has one stack it > >>> ships out of the box with, one stack that's the default stack in the > >> sense > >>> that it's the most tested and so on. And, for us, that's our current > >> stack. > >>> If we were to take Apache Hive as an example, it shipped (and ships) > with > >>> MapReduce as the default configuration engine. At some point, Apache > Tez > >>> came along and wanted Hive to run on Tez, so they made a bunch of > things > >>> pluggable to run Hive on Tez (instead of the only option up-until then: > >>> Hive-on-MR) and then Apache Spark came and re-used some of that > >>> pluggability and even added some more so Hive-on-Spark could become a > >>> reality. In the same way, I don't think anyone disagrees here that > >>> pluggabilty is a good thing but it's hard to do pluggability right, and > >> at > >>> the right level, unless on has a clear use-case in mind. > >>> > >>> As a project, we have many things to do and I personally think the > >> biggest > >>> bang for the buck for us in making Spot a really solid and the best > cyber > >>> security solution isn't pluggability but the things we are working on > - a > >>> better user interface, a common/unified approach to storing and > modeling > >>> data, etc. > >>> > >>> Having said that, we are open, if it's important to you or someone > else, > >>> we'd be happy to receive and review those patches. > >>> > >>> Thanks! > >>> Mark > >>> > >>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]> > >> wrote: > >>> > >>>> Thanks Ross! and yes option C sounds good to me as well however I just > >>>> think Distributed Sql query engine and the resource manager should be > >>>> pluggable. > >>>> > >>>> > >>>> > >>>> > >>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <[email protected]> > >>>> wrote: > >>>> > >>>>> Option C is to use python on the front end of ingest pipeline and > >>>>> spark/scala on the back end. > >>>>> > >>>>> Option A uses python workers on the backend > >>>>> > >>>>> Option B uses all scala. > >>>>> > >>>>> > >>>>> > >>>>> -----Original Message----- > >>>>> From: kant kodali [mailto:[email protected]] > >>>>> Sent: Friday, April 14, 2017 9:53 AM > >>>>> To: [email protected] > >>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > >>>>> > >>>>> What is option C ? am I missing an email or something? > >>>>> > >>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> +1 for Python 3.x > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote: > >>>>>> > >>>>>>> I think that C is the strong solution, getting the ingest really > >>>>>>> strong is going to lower barriers to adoption. Doing it in Python > >>>>>>> will open up the ingest portion of the project to include many > >> more > >>>>> developers. > >>>>>>> > >>>>>>> Before it comes up I would like to throw the following on the > >>> pile... > >>>>>>> Major > >>>>>>> python projects django/flash, others are dropping 2.x support in > >>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects in > >>>>>>> general tend to lag in modern python support, lets please build > >> this > >>>>>>> in 3.5 so that we don't have to immediately expect a rebuild in > >> the > >>>>>>> pipeline. > >>>>>>> > >>>>>>> -Vote C > >>>>>>> > >>>>>>> Thanks Nate > >>>>>>> > >>>>>>> Austin > >>>>>>> > >>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> > >> wrote: > >>>>>>> > >>>>>>> I really like option C because it gives a lot of flexibility for > >>>>>>> ingest > >>>>>>>> (python vs scala) but still has the robust spark streaming > >> backend > >>>>>>>> for performance. > >>>>>>>> > >>>>>>>> Thanks for putting this together Nate. > >>>>>>>> > >>>>>>>> Alan > >>>>>>>> > >>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < > >>>>>>>> [email protected]> wrote: > >>>>>>>> > >>>>>>>> I agree. We should continue making the existing stack more mature > >>> at > >>>>>>>>> this point. Maybe if we have enough community support we can add > >>>>>>>>> additional datastores. > >>>>>>>>> > >>>>>>>>> Chokha. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote: > >>>>>>>>> > >>>>>>>>>> Hi Kant, > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using > >>>>>>>>>> Hive+Spark, then sure you'll have YARN. > >>>>>>>>>> > >>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based > >> on > >>> a > >>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many > >> pieces > >>>>> yet. > >>>>>>>>>> > >>>>>>>>>> In most Opensource projects you start relying on a well-known > >>>>>>>>>> stack and then you begin to support other DB backends once it's > >>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't > >> been > >>>>>>>>>> ported away from MySQL yet. > >>>>>>>>>> > >>>>>>>>>> In any case, you'll need a high performance SQL + Massive > >> Storage > >>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that can be > >>>>>>>>>> only provided by Hadoop. > >>>>>>>>>> > >>>>>>>>>> Regards! > >>>>>>>>>> > >>>>>>>>>> Kenneth > >>>>>>>>>> > >>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué: > >>>>>>>>>> > >>>>>>>>>>> Hi Kenneth, > >>>>>>>>>>> > >>>>>>>>>>> Thanks for the response. I think you made a case for HDFS > >>>>>>>>>>> however users may want to use S3 or some other FS in which > >> case > >>>>>>>>>>> they can use Auxilio (hoping that there are no changes needed > >>>>>>>>>>> within Spot in which case I > >>>>>>>>>>> > >>>>>>>>>> can > >>>>>>>> > >>>>>>>>> agree to that). for example, Netflix stores all there data into > >> S3 > >>>>>>>>>>> > >>>>>>>>>>> The distributed sql query engine I would say should be > >> pluggable > >>>>>>>>>>> with whatever user may want to use and there a bunch of them > >> out > >>>>> there. > >>>>>>>>>>> > >>>>>>>>>> sure > >>>>>>>> > >>>>>>>>> Impala is better than hive but what if users are already using > >>>>>>>>>>> > >>>>>>>>>> something > >>>>>>>> > >>>>>>>>> else like Drill or Presto? > >>>>>>>>>>> > >>>>>>>>>>> Me personally, would not assume that users are willing to > >> deploy > >>>>>>>>>>> all > >>>>>>>>>>> > >>>>>>>>>> of > >>>>>>>> > >>>>>>>>> that and make their existing stack more complicated at very > >> least > >>> I > >>>>>>>>>>> would > >>>>>>>>>>> say it is a uphill battle. Things have been changing rapidly > >> in > >>>>>>>>>>> Big > >>>>>>>>>>> > >>>>>>>>>> data > >>>>>>>> > >>>>>>>>> space so whatever we think is standard won't be standard anymore > >>>>>>>>> but > >>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't be > >>>>>>>>>>> flexible right. > >>>>>>>>>>> > >>>>>>>>>>> Also I am not sure why only YARN? why not make that also more > >>>>>>>>>>> flexible so users can pick Mesos or standalone. > >>>>>>>>>>> > >>>>>>>>>>> I think Flexibility is a key for a wide adoption rather than > >> the > >>>>>>>>>>> > >>>>>>>>>> tightly > >>>>>>>> > >>>>>>>>> coupled architecture. > >>>>>>>>>>> > >>>>>>>>>>> Thanks! > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza > >>>>>>>>>>> <[email protected]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> PS: you need a big data platform to be able to collect all > >> those > >>>>>>>>>>>> netflows > >>>>>>>>>>>> and logs. > >>>>>>>>>>>> > >>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need > >> loads > >>>>>>>>>>>> of data to get ML working properly, and somewhere to run > >> those > >>>>>>>>>>>> algorithms. That > >>>>>>>>>>>> > >>>>>>>>>>> is > >>>>>>>> > >>>>>>>>> Hadoop. > >>>>>>>>>>>> > >>>>>>>>>>>> Regards! > >>>>>>>>>>>> > >>>>>>>>>>>> Kenneth > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Sent from my Mi phone > >>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM > >>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks for starting this thread. Here is my feedback. > >>>>>>>>>>>> > >>>>>>>>>>>> I somehow think the architecture is too complicated for wide > >>>>>>>>>>>> adoption since it requires to install the following. > >>>>>>>>>>>> > >>>>>>>>>>>> HDFS. > >>>>>>>>>>>> HIVE. > >>>>>>>>>>>> IMPALA. > >>>>>>>>>>>> KAFKA. > >>>>>>>>>>>> SPARK (YARN). > >>>>>>>>>>>> YARN. > >>>>>>>>>>>> Zookeeper. > >>>>>>>>>>>> > >>>>>>>>>>>> Currently there are way too many dependencies that > >> discourages > >>>>>>>>>>>> lot of users from using it because they have to go through > >>>>>>>>>>>> deployment of all that required software. I think for wide > >>>>>>>>>>>> option we should minimize the dependencies and have more > >>>>>>>>>>>> pluggable architecture. for example I am > >>>>>>>>>>>> > >>>>>>>>>>> not > >>>>>>>> > >>>>>>>>> sure why HIVE & IMPALA both are required? why not just use Spark > >>>>>>>>> SQL > >>>>>>>>>>>> since > >>>>>>>>>>>> its already dependency or say users may want to use their own > >>>>>>>>>>>> distributed query engine they like such as Apache Drill or > >>>>>>>>>>>> something else. we should be flexible enough to provide that > >>>>>>>>>>>> option > >>>>>>>>>>>> > >>>>>>>>>>>> Also, I see that HDFS is used such that collectors can > >> receive > >>>>>>>>>>>> file path's through Kafka and be able to read a file. How big > >>>>>>>>>>>> are these files ? > >>>>>>>>>>>> Do we > >>>>>>>>>>>> really need HDFS for this? Why not provide more ways to send > >>>>>>>>>>>> data such as sending data directly through Kafka or say just > >>>>>>>>>>>> leaving up to the user to specify the file location as an > >>>>>>>>>>>> argument to collector process > >>>>>>>>>>>> > >>>>>>>>>>>> Finally, I learnt that to generate Net flow data one would > >>>>>>>>>>>> require a specific hardware. This really means Apache Spot is > >>>>>>>>>>>> not meant for everyone. > >>>>>>>>>>>> I thought Apache Spot can be used to analyze the network > >>> traffic > >>>>>>>>>>>> of > >>>>>>>>>>>> > >>>>>>>>>>> any > >>>>>>>> > >>>>>>>>> machine but if it requires a specific hard then I think it is > >>>>>>>>>>>> targeted for > >>>>>>>>>>>> specific group of people. > >>>>>>>>>>>> > >>>>>>>>>>>> The real strength of Apache Spot should mainly be just > >>> analyzing > >>>>>>>>>>>> network traffic through ML. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks! > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < > >>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, Nate, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Nate. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]] > >>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM > >>>>>>>>>>>>> To: [email protected] > >>>>>>>>>>>>> Cc: [email protected]; > >>>>>>>>>>>>> > >>>>>>>>>>>> [email protected] > >>>>>>>> > >>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > >>>>>>>>>>>>> > >>>>>>>>>>>>> I was really hoping it came through ok, Oh well :) Here’s an > >>>>>>>>>>>>> image form: > >>>>>>>>>>>>> http://imgur.com/a/DUDsD > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < > >>>>>>>>>>>>>> > >>>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> The diagram became garbled in the text format. > >>>>>>>>>>>>>> Could you resend it as a pdf? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Nate > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]] > >>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM > >>>>>>>>>>>>>> To: [email protected]; > >>>>>>>>>>>>>> > >>>>>>>>>>>>> [email protected]; > >>>>>>>>>>>> > >>>>>>>>>>>>> [email protected] > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> How would you like to see Spot-ingest change? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> A. continue development on the Python Master/Worker with > >>> focus > >>>>>>>>>>>>>> on > >>>>>>>>>>>>>> > >>>>>>>>>>>>> performance / error handling / logging B. Develop Scala > >> based > >>>>>>>>>>>>> > >>>>>>>>>>>> ingest to > >>>>>>>>>>>> be > >>>>>>>>>>>> > >>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to continue > >>>>>>>>>>>>> being > >>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark > >>> code > >>>>>>>>>>>>> for normalization and input into DB > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Including the high level diagram: > >>>>>>>>>>>>>> +----------------------------- > >> ------------------------------ > >>>>>>>>>>>>>> > >>>>>>>>>>>>> -------------------------------+ > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | +--------------------------+ > >>>>>>>>>>>>>> > >>>>>>>>>>>>> +-----------------+ | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | Master | A. B. C. > >>>> | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> Worker | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | A. Python +---------------+ A. > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | A. > >>>>>>>>>>>> > >>>>>>>>>>>>> Python | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | B. Scala | | > >>>> +-------------> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> +----+ | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | C. Python | | | > >>>> | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | +---^------+---------------+ | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> +-----------------+ | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | | | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | | | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | +Note--------------+ | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> +-----------------+ | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | |Running on a | | | > >>>> | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> Spark > >>>>>>>>>>>> > >>>>>>>>>>>>> Streaming | | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | |worker node in | | | B. > >>> C. > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | B. > >>>>>>>>>>>> > >>>>>>>>>>>>> Scala | | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | |the Hadoop cluster| | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> +--------> C. > >>>>>>>>>>>> > >>>>>>>>>>>>> Scala +-+ | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | +------------------+ | | | > >>>> | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | A.| | | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> +-----------------+ | | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | B.| | | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | C.| | | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | +----------------------+ +-v------+----+----+-+ > >>>>>>>>>>>>>> > >>>>>>>>>>>>> +--------------v--v-+ | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | Local FS: | | hdfs > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>> > >>>>>>>>>>>>> Hive / Impala | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | - Binary/Text | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>> > >>>>>>>>>>>>> - Parquet - | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | Log files - | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | | | | > >>>>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>> > >>>>>>>>>>>>> | | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> | +----------------------+ +--------------------+ > >>>>>>>>>>>>>> > >>>>>>>>>>>>> +-------------------+ | > >>>>>>>>>>>>> > >>>>>>>>>>>>>> +----------------------------- > >> ------------------------------ > >>>>>>>>>>>>>> > >>>>>>>>>>>>> -------------------------------+ > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Please let me know your thoughts, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> - Nathanael > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > > > > > > -- > > Michael Ridley <[email protected]> > > office: (650) 352-1337 > > mobile: (571) 438-2420 > > Senior Solutions Architect > > Cloudera, Inc. > > -- Michael Ridley <[email protected]> office: (650) 352-1337 mobile: (571) 438-2420 Senior Solutions Architect Cloudera, Inc.
