I had thought that the plan was to write the data in Parquet in HDFS ultimately.
Michael On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]> wrote: > Hi Mark, > > Thank you so much for hearing my argument. And I definetly understand that > you guys have bunch of things to do. My only concern is that I hope it > doesn't take too long too support other backends. For example @Kenneth had > given an example of LAMP stack had not moved away from mysql yet which > essentially means its probably a decade ? I see that in the current > architecture the results from with python multiprocessing or Spark > Streaming are written back to HDFS and If so, can we write them in parquet > format ? such that users should be able to plug in any query engine but > again I am not pushing you guys to do this right away or anything just > seeing if there a way for me to get started in parallel and if not > feasible, its fine I just wanted to share my 2 cents and I am glad my > argument is heard! > > Thanks much! > > On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> wrote: > > > Hi Kant, > > Just wanted to make sure you don't feel like we are ignoring your > > comment:-) I hear you about pluggability. > > > > The design can and should be pluggable but the project has one stack it > > ships out of the box with, one stack that's the default stack in the > sense > > that it's the most tested and so on. And, for us, that's our current > stack. > > If we were to take Apache Hive as an example, it shipped (and ships) with > > MapReduce as the default configuration engine. At some point, Apache Tez > > came along and wanted Hive to run on Tez, so they made a bunch of things > > pluggable to run Hive on Tez (instead of the only option up-until then: > > Hive-on-MR) and then Apache Spark came and re-used some of that > > pluggability and even added some more so Hive-on-Spark could become a > > reality. In the same way, I don't think anyone disagrees here that > > pluggabilty is a good thing but it's hard to do pluggability right, and > at > > the right level, unless on has a clear use-case in mind. > > > > As a project, we have many things to do and I personally think the > biggest > > bang for the buck for us in making Spot a really solid and the best cyber > > security solution isn't pluggability but the things we are working on - a > > better user interface, a common/unified approach to storing and modeling > > data, etc. > > > > Having said that, we are open, if it's important to you or someone else, > > we'd be happy to receive and review those patches. > > > > Thanks! > > Mark > > > > On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]> > wrote: > > > > > Thanks Ross! and yes option C sounds good to me as well however I just > > > think Distributed Sql query engine and the resource manager should be > > > pluggable. > > > > > > > > > > > > > > > On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <[email protected]> > > > wrote: > > > > > > > Option C is to use python on the front end of ingest pipeline and > > > > spark/scala on the back end. > > > > > > > > Option A uses python workers on the backend > > > > > > > > Option B uses all scala. > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: kant kodali [mailto:[email protected]] > > > > Sent: Friday, April 14, 2017 9:53 AM > > > > To: [email protected] > > > > Subject: Re: [Discuss] - Future plans for Spot-ingest > > > > > > > > What is option C ? am I missing an email or something? > > > > > > > > On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai < > > > > [email protected]> wrote: > > > > > > > > > +1 for Python 3.x > > > > > > > > > > > > > > > > > > > > On 4/14/2017 11:59 AM, Austin Leahy wrote: > > > > > > > > > >> I think that C is the strong solution, getting the ingest really > > > > >> strong is going to lower barriers to adoption. Doing it in Python > > > > >> will open up the ingest portion of the project to include many > more > > > > developers. > > > > >> > > > > >> Before it comes up I would like to throw the following on the > > pile... > > > > >> Major > > > > >> python projects django/flash, others are dropping 2.x support in > > > > >> releases scheduled in the next 6 to 8 months. Hadoop projects in > > > > >> general tend to lag in modern python support, lets please build > this > > > > >> in 3.5 so that we don't have to immediately expect a rebuild in > the > > > > >> pipeline. > > > > >> > > > > >> -Vote C > > > > >> > > > > >> Thanks Nate > > > > >> > > > > >> Austin > > > > >> > > > > >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> > wrote: > > > > >> > > > > >> I really like option C because it gives a lot of flexibility for > > > > >> ingest > > > > >>> (python vs scala) but still has the robust spark streaming > backend > > > > >>> for performance. > > > > >>> > > > > >>> Thanks for putting this together Nate. > > > > >>> > > > > >>> Alan > > > > >>> > > > > >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < > > > > >>> [email protected]> wrote: > > > > >>> > > > > >>> I agree. We should continue making the existing stack more mature > > at > > > > >>>> this point. Maybe if we have enough community support we can add > > > > >>>> additional datastores. > > > > >>>> > > > > >>>> Chokha. > > > > >>>> > > > > >>>> > > > > >>>> On 4/14/17 11:10 AM, [email protected] wrote: > > > > >>>> > > > > >>>>> Hi Kant, > > > > >>>>> > > > > >>>>> > > > > >>>>> YARN is the standard scheduler in Hadoop. If you're using > > > > >>>>> Hive+Spark, then sure you'll have YARN. > > > > >>>>> > > > > >>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based > on > > a > > > > >>>>> quite standard Hadoop stack and I wouldn't switch too many > pieces > > > > yet. > > > > >>>>> > > > > >>>>> In most Opensource projects you start relying on a well-known > > > > >>>>> stack and then you begin to support other DB backends once it's > > > > >>>>> quite mature. Think in the loads of LAMP apps which haven't > been > > > > >>>>> ported away from MySQL yet. > > > > >>>>> > > > > >>>>> In any case, you'll need a high performance SQL + Massive > Storage > > > > >>>>> + Machine Learning + Massive Ingestion, and... ATM, that can be > > > > >>>>> only provided by Hadoop. > > > > >>>>> > > > > >>>>> Regards! > > > > >>>>> > > > > >>>>> Kenneth > > > > >>>>> > > > > >>>>> A 2017-04-14 12:56, kant kodali escrigué: > > > > >>>>> > > > > >>>>>> Hi Kenneth, > > > > >>>>>> > > > > >>>>>> Thanks for the response. I think you made a case for HDFS > > > > >>>>>> however users may want to use S3 or some other FS in which > case > > > > >>>>>> they can use Auxilio (hoping that there are no changes needed > > > > >>>>>> within Spot in which case I > > > > >>>>>> > > > > >>>>> can > > > > >>> > > > > >>>> agree to that). for example, Netflix stores all there data into > S3 > > > > >>>>>> > > > > >>>>>> The distributed sql query engine I would say should be > pluggable > > > > >>>>>> with whatever user may want to use and there a bunch of them > out > > > > there. > > > > >>>>>> > > > > >>>>> sure > > > > >>> > > > > >>>> Impala is better than hive but what if users are already using > > > > >>>>>> > > > > >>>>> something > > > > >>> > > > > >>>> else like Drill or Presto? > > > > >>>>>> > > > > >>>>>> Me personally, would not assume that users are willing to > deploy > > > > >>>>>> all > > > > >>>>>> > > > > >>>>> of > > > > >>> > > > > >>>> that and make their existing stack more complicated at very > least > > I > > > > >>>>>> would > > > > >>>>>> say it is a uphill battle. Things have been changing rapidly > in > > > > >>>>>> Big > > > > >>>>>> > > > > >>>>> data > > > > >>> > > > > >>>> space so whatever we think is standard won't be standard anymore > > > > >>>> but > > > > >>>>>> importantly there shouldn't be any reason why we shouldn't be > > > > >>>>>> flexible right. > > > > >>>>>> > > > > >>>>>> Also I am not sure why only YARN? why not make that also more > > > > >>>>>> flexible so users can pick Mesos or standalone. > > > > >>>>>> > > > > >>>>>> I think Flexibility is a key for a wide adoption rather than > the > > > > >>>>>> > > > > >>>>> tightly > > > > >>> > > > > >>>> coupled architecture. > > > > >>>>>> > > > > >>>>>> Thanks! > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza > > > > >>>>>> <[email protected]> > > > > >>>>>> wrote: > > > > >>>>>> > > > > >>>>>> PS: you need a big data platform to be able to collect all > those > > > > >>>>>>> netflows > > > > >>>>>>> and logs. > > > > >>>>>>> > > > > >>>>>>> Spot isn't intended for SMBs, that's clear, then you need > loads > > > > >>>>>>> of data to get ML working properly, and somewhere to run > those > > > > >>>>>>> algorithms. That > > > > >>>>>>> > > > > >>>>>> is > > > > >>> > > > > >>>> Hadoop. > > > > >>>>>>> > > > > >>>>>>> Regards! > > > > >>>>>>> > > > > >>>>>>> Kenneth > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> Sent from my Mi phone > > > > >>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM > > wrote: > > > > >>>>>>> > > > > >>>>>>> Hi, > > > > >>>>>>> > > > > >>>>>>> Thanks for starting this thread. Here is my feedback. > > > > >>>>>>> > > > > >>>>>>> I somehow think the architecture is too complicated for wide > > > > >>>>>>> adoption since it requires to install the following. > > > > >>>>>>> > > > > >>>>>>> HDFS. > > > > >>>>>>> HIVE. > > > > >>>>>>> IMPALA. > > > > >>>>>>> KAFKA. > > > > >>>>>>> SPARK (YARN). > > > > >>>>>>> YARN. > > > > >>>>>>> Zookeeper. > > > > >>>>>>> > > > > >>>>>>> Currently there are way too many dependencies that > discourages > > > > >>>>>>> lot of users from using it because they have to go through > > > > >>>>>>> deployment of all that required software. I think for wide > > > > >>>>>>> option we should minimize the dependencies and have more > > > > >>>>>>> pluggable architecture. for example I am > > > > >>>>>>> > > > > >>>>>> not > > > > >>> > > > > >>>> sure why HIVE & IMPALA both are required? why not just use Spark > > > > >>>> SQL > > > > >>>>>>> since > > > > >>>>>>> its already dependency or say users may want to use their own > > > > >>>>>>> distributed query engine they like such as Apache Drill or > > > > >>>>>>> something else. we should be flexible enough to provide that > > > > >>>>>>> option > > > > >>>>>>> > > > > >>>>>>> Also, I see that HDFS is used such that collectors can > receive > > > > >>>>>>> file path's through Kafka and be able to read a file. How big > > > > >>>>>>> are these files ? > > > > >>>>>>> Do we > > > > >>>>>>> really need HDFS for this? Why not provide more ways to send > > > > >>>>>>> data such as sending data directly through Kafka or say just > > > > >>>>>>> leaving up to the user to specify the file location as an > > > > >>>>>>> argument to collector process > > > > >>>>>>> > > > > >>>>>>> Finally, I learnt that to generate Net flow data one would > > > > >>>>>>> require a specific hardware. This really means Apache Spot is > > > > >>>>>>> not meant for everyone. > > > > >>>>>>> I thought Apache Spot can be used to analyze the network > > traffic > > > > >>>>>>> of > > > > >>>>>>> > > > > >>>>>> any > > > > >>> > > > > >>>> machine but if it requires a specific hard then I think it is > > > > >>>>>>> targeted for > > > > >>>>>>> specific group of people. > > > > >>>>>>> > > > > >>>>>>> The real strength of Apache Spot should mainly be just > > analyzing > > > > >>>>>>> network traffic through ML. > > > > >>>>>>> > > > > >>>>>>> Thanks! > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < > > > > >>>>>>> [email protected]> wrote: > > > > >>>>>>> > > > > >>>>>>> Thanks, Nate, > > > > >>>>>>>> > > > > >>>>>>>> Nate. > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> -----Original Message----- > > > > >>>>>>>> From: Nate Smith [mailto:[email protected]] > > > > >>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM > > > > >>>>>>>> To: [email protected] > > > > >>>>>>>> Cc: [email protected]; > > > > >>>>>>>> > > > > >>>>>>> [email protected] > > > > >>> > > > > >>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > > > > >>>>>>>> > > > > >>>>>>>> I was really hoping it came through ok, Oh well :) Here’s an > > > > >>>>>>>> image form: > > > > >>>>>>>> http://imgur.com/a/DUDsD > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < > > > > >>>>>>>>> > > > > >>>>>>>> [email protected]> wrote: > > > > >>>>>>>> > > > > >>>>>>>>> The diagram became garbled in the text format. > > > > >>>>>>>>> Could you resend it as a pdf? > > > > >>>>>>>>> > > > > >>>>>>>>> Thanks, > > > > >>>>>>>>> Nate > > > > >>>>>>>>> > > > > >>>>>>>>> -----Original Message----- > > > > >>>>>>>>> From: Nathanael Smith [mailto:[email protected]] > > > > >>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM > > > > >>>>>>>>> To: [email protected]; > > > > >>>>>>>>> > > > > >>>>>>>> [email protected]; > > > > >>>>>>> > > > > >>>>>>>> [email protected] > > > > >>>>>>>> > > > > >>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest > > > > >>>>>>>>> > > > > >>>>>>>>> How would you like to see Spot-ingest change? > > > > >>>>>>>>> > > > > >>>>>>>>> A. continue development on the Python Master/Worker with > > focus > > > > >>>>>>>>> on > > > > >>>>>>>>> > > > > >>>>>>>> performance / error handling / logging B. Develop Scala > based > > > > >>>>>>>> > > > > >>>>>>> ingest to > > > > >>>>>>> be > > > > >>>>>>> > > > > >>>>>>>> inline with code base from ingest, ml, to OA (UI to continue > > > > >>>>>>>> being > > > > >>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark > > code > > > > >>>>>>>> for normalization and input into DB > > > > >>>>>>>> > > > > >>>>>>>>> Including the high level diagram: > > > > >>>>>>>>> +----------------------------- > ------------------------------ > > > > >>>>>>>>> > > > > >>>>>>>> -------------------------------+ > > > > >>>>>>>> > > > > >>>>>>>>> | +--------------------------+ > > > > >>>>>>>>> > > > > >>>>>>>> +-----------------+ | > > > > >>>>>>>> > > > > >>>>>>>>> | | Master | A. B. C. > > > | > > > > >>>>>>>>> > > > > >>>>>>>> Worker | | > > > > >>>>>>>> > > > > >>>>>>>>> | | A. Python +---------------+ A. > > > > >>>>>>>>> > > > > >>>>>>>> | A. > > > > >>>>>>> > > > > >>>>>>>> Python | | > > > > >>>>>>>> > > > > >>>>>>>>> | | B. Scala | | > > > +-------------> > > > > >>>>>>>>> > > > > >>>>>>>> +----+ | > > > > >>>>>>>> > > > > >>>>>>>>> | | C. Python | | | > > > | > > > > >>>>>>>>> > > > > >>>>>>>> | | | > > > > >>>>>>>> > > > > >>>>>>>>> | +---^------+---------------+ | | > > > > >>>>>>>>> > > > > >>>>>>>> +-----------------+ | | > > > > >>>>>>>> > > > > >>>>>>>>> | | | | | > > > > >>>>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>>> > > > > >>>>>>>>> | | | | | > > > > >>>>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>>> > > > > >>>>>>>>> | | +Note--------------+ | | > > > > >>>>>>>>> > > > > >>>>>>>> +-----------------+ | | > > > > >>>>>>>> > > > > >>>>>>>>> | | |Running on a | | | > > > | > > > > >>>>>>>>> > > > > >>>>>>>> Spark > > > > >>>>>>> > > > > >>>>>>>> Streaming | | | > > > > >>>>>>>> > > > > >>>>>>>>> | | |worker node in | | | B. > > C. > > > > >>>>>>>>> > > > > >>>>>>>> | B. > > > > >>>>>>> > > > > >>>>>>>> Scala | | | > > > > >>>>>>>> > > > > >>>>>>>>> | | |the Hadoop cluster| | | > > > > >>>>>>>>> > > > > >>>>>>>> +--------> C. > > > > >>>>>>> > > > > >>>>>>>> Scala +-+ | | > > > > >>>>>>>> > > > > >>>>>>>>> | | +------------------+ | | | > > > | > > > > >>>>>>>>> > > > > >>>>>>>> | | | | > > > > >>>>>>>> > > > > >>>>>>>>> | A.| | | | > > > > >>>>>>>>> > > > > >>>>>>>> +-----------------+ | | | > > > > >>>>>>>> > > > > >>>>>>>>> | B.| | | | > > > > >>>>>>>>> > > > > >>>>>>>> | | | > > > > >>>>>>>> > > > > >>>>>>>>> | C.| | | | > > > > >>>>>>>>> > > > > >>>>>>>> | | | > > > > >>>>>>>> > > > > >>>>>>>>> | +----------------------+ +-v------+----+----+-+ > > > > >>>>>>>>> > > > > >>>>>>>> +--------------v--v-+ | > > > > >>>>>>>> > > > > >>>>>>>>> | | | | > > > > >>>>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>>> > > > > >>>>>>>>> | | Local FS: | | hdfs > > > > >>>>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>> > > > > >>>>>>>> Hive / Impala | | > > > > >>>>>>>> > > > > >>>>>>>>> | | - Binary/Text | | > > > > >>>>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>> > > > > >>>>>>>> - Parquet - | | > > > > >>>>>>>> > > > > >>>>>>>>> | | Log files - | | > > > > >>>>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>>> > > > > >>>>>>>>> | | | | > > > > >>>>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>> > > > > >>>>>>>> | | > > > > >>>>>>>> > > > > >>>>>>>>> | +----------------------+ +--------------------+ > > > > >>>>>>>>> > > > > >>>>>>>> +-------------------+ | > > > > >>>>>>>> > > > > >>>>>>>>> +----------------------------- > ------------------------------ > > > > >>>>>>>>> > > > > >>>>>>>> -------------------------------+ > > > > >>>>>>>> > > > > >>>>>>>>> Please let me know your thoughts, > > > > >>>>>>>>> > > > > >>>>>>>>> - Nathanael > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>> > > > > > > > > > > > > > > > -- Michael Ridley <[email protected]> office: (650) 352-1337 mobile: (571) 438-2420 Senior Solutions Architect Cloudera, Inc.
