Re: [Discuss] - Future plans for Spot-ingest

Michael Ridley Mon, 17 Apr 2017 11:09:06 -0700

I had thought that the plan was to write the data in Parquet in HDFS
ultimately.


Michael

On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]> wrote:

> Hi Mark,
>
> Thank you so much for hearing my argument. And I definetly understand that
> you guys have bunch of things to do. My only concern is that I hope it
> doesn't take too long too support other backends. For example @Kenneth had
> given an example of LAMP stack had not moved away from mysql yet which
> essentially means its probably a decade ? I see that in the current
> architecture the results from with python multiprocessing or Spark
> Streaming are written back to HDFS and  If so, can we write them in parquet
> format ? such that users should be able to plug in any query engine but
> again I am not pushing you guys to do this right away or anything just
> seeing if there a way for me to get started in parallel and if not
> feasible, its fine I just wanted to share my 2 cents and I am glad my
> argument is heard!
>
> Thanks much!
>
> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> wrote:
>
> > Hi Kant,
> > Just wanted to make sure you don't feel like we are ignoring your
> > comment:-) I hear you about pluggability.
> >
> > The design can and should be pluggable but the project has one stack it
> > ships out of the box with, one stack that's the default stack in the
> sense
> > that it's the most tested and so on. And, for us, that's our current
> stack.
> > If we were to take Apache Hive as an example, it shipped (and ships) with
> > MapReduce as the default configuration engine. At some point, Apache Tez
> > came along and wanted Hive to run on Tez, so they made a bunch of things
> > pluggable to run Hive on Tez (instead of the only option up-until then:
> > Hive-on-MR) and then Apache Spark came and re-used some of that
> > pluggability and even added some more so Hive-on-Spark could become a
> > reality. In the same way, I don't think anyone disagrees here that
> > pluggabilty is a good thing but it's hard to do pluggability right, and
> at
> > the right level, unless on has a clear use-case in mind.
> >
> > As a project, we have many things to do and I personally think the
> biggest
> > bang for the buck for us in making Spot a really solid and the best cyber
> > security solution isn't pluggability but the things we are working on - a
> > better user interface, a common/unified approach to storing and modeling
> > data, etc.
> >
> > Having said that, we are open, if it's important to you or someone else,
> > we'd be happy to receive and review those patches.
> >
> > Thanks!
> > Mark
> >
> > On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]>
> wrote:
> >
> > > Thanks Ross! and yes option C sounds good to me as well however I just
> > > think Distributed Sql query engine  and the resource manager should be
> > > pluggable.
> > >
> > >
> > >
> > >
> > > On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <[email protected]>
> > > wrote:
> > >
> > > > Option C is to use python on the front end of ingest pipeline and
> > > > spark/scala on the back end.
> > > >
> > > > Option A uses python workers on the backend
> > > >
> > > > Option B uses all scala.
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: kant kodali [mailto:[email protected]]
> > > > Sent: Friday, April 14, 2017 9:53 AM
> > > > To: [email protected]
> > > > Subject: Re: [Discuss] - Future plans for Spot-ingest
> > > >
> > > > What is option C ? am I missing an email or something?
> > > >
> > > > On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
> > > > [email protected]> wrote:
> > > >
> > > > > +1 for Python 3.x
> > > > >
> > > > >
> > > > >
> > > > > On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > > > >
> > > > >> I think that C is the strong solution, getting the ingest really
> > > > >> strong is going to lower barriers to adoption. Doing it in Python
> > > > >> will open up the ingest portion of the project to include many
> more
> > > > developers.
> > > > >>
> > > > >> Before it comes up I would like to throw the following on the
> > pile...
> > > > >> Major
> > > > >> python projects django/flash, others are dropping 2.x support in
> > > > >> releases scheduled in the next 6 to 8 months. Hadoop projects in
> > > > >> general tend to lag in modern python support, lets please build
> this
> > > > >> in 3.5 so that we don't have to immediately expect a rebuild in
> the
> > > > >> pipeline.
> > > > >>
> > > > >> -Vote C
> > > > >>
> > > > >> Thanks Nate
> > > > >>
> > > > >> Austin
> > > > >>
> > > > >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]>
> wrote:
> > > > >>
> > > > >> I really like option C because it gives a lot of flexibility for
> > > > >> ingest
> > > > >>> (python vs scala) but still has the robust spark streaming
> backend
> > > > >>> for performance.
> > > > >>>
> > > > >>> Thanks for putting this together Nate.
> > > > >>>
> > > > >>> Alan
> > > > >>>
> > > > >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
> > > > >>> [email protected]> wrote:
> > > > >>>
> > > > >>> I agree. We should continue making the existing stack more mature
> > at
> > > > >>>> this point. Maybe if we have enough community support we can add
> > > > >>>> additional datastores.
> > > > >>>>
> > > > >>>> Chokha.
> > > > >>>>
> > > > >>>>
> > > > >>>> On 4/14/17 11:10 AM, [email protected] wrote:
> > > > >>>>
> > > > >>>>> Hi Kant,
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> YARN is the standard scheduler in Hadoop. If you're using
> > > > >>>>> Hive+Spark, then sure you'll have YARN.
> > > > >>>>>
> > > > >>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based
> on
> > a
> > > > >>>>> quite standard Hadoop stack and I wouldn't switch too many
> pieces
> > > > yet.
> > > > >>>>>
> > > > >>>>> In most Opensource projects you start relying on a well-known
> > > > >>>>> stack and then you begin to support other DB backends once it's
> > > > >>>>> quite mature. Think in the loads of LAMP apps which haven't
> been
> > > > >>>>> ported away from MySQL yet.
> > > > >>>>>
> > > > >>>>> In any case, you'll need a high performance SQL + Massive
> Storage
> > > > >>>>> + Machine Learning + Massive Ingestion, and... ATM, that can be
> > > > >>>>> only provided by Hadoop.
> > > > >>>>>
> > > > >>>>> Regards!
> > > > >>>>>
> > > > >>>>> Kenneth
> > > > >>>>>
> > > > >>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > > > >>>>>
> > > > >>>>>> Hi Kenneth,
> > > > >>>>>>
> > > > >>>>>> Thanks for the response.  I think you made a case for HDFS
> > > > >>>>>> however users may want to use S3 or some other FS in which
> case
> > > > >>>>>> they can use Auxilio (hoping that there are no changes needed
> > > > >>>>>> within Spot in which case I
> > > > >>>>>>
> > > > >>>>> can
> > > > >>>
> > > > >>>> agree to that). for example, Netflix stores all there data into
> S3
> > > > >>>>>>
> > > > >>>>>> The distributed sql query engine I would say should be
> pluggable
> > > > >>>>>> with whatever user may want to use and there a bunch of them
> out
> > > > there.
> > > > >>>>>>
> > > > >>>>> sure
> > > > >>>
> > > > >>>> Impala is better than hive but what if users are already using
> > > > >>>>>>
> > > > >>>>> something
> > > > >>>
> > > > >>>> else like Drill or Presto?
> > > > >>>>>>
> > > > >>>>>> Me personally, would not assume that users are willing to
> deploy
> > > > >>>>>> all
> > > > >>>>>>
> > > > >>>>> of
> > > > >>>
> > > > >>>> that and make their existing stack more complicated at very
> least
> > I
> > > > >>>>>> would
> > > > >>>>>> say it is a uphill battle. Things have been changing rapidly
> in
> > > > >>>>>> Big
> > > > >>>>>>
> > > > >>>>> data
> > > > >>>
> > > > >>>> space so whatever we think is standard won't be standard anymore
> > > > >>>> but
> > > > >>>>>> importantly there shouldn't be any reason why we shouldn't be
> > > > >>>>>> flexible right.
> > > > >>>>>>
> > > > >>>>>> Also I am not sure why only YARN? why not make that also more
> > > > >>>>>> flexible so users can pick Mesos or standalone.
> > > > >>>>>>
> > > > >>>>>> I think Flexibility is a key for a wide adoption rather than
> the
> > > > >>>>>>
> > > > >>>>> tightly
> > > > >>>
> > > > >>>> coupled architecture.
> > > > >>>>>>
> > > > >>>>>> Thanks!
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> > > > >>>>>> <[email protected]>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>> PS: you need a big data platform to be able to collect all
> those
> > > > >>>>>>> netflows
> > > > >>>>>>> and logs.
> > > > >>>>>>>
> > > > >>>>>>> Spot isn't intended for SMBs, that's clear, then you need
> loads
> > > > >>>>>>> of data to get ML working properly, and somewhere to run
> those
> > > > >>>>>>> algorithms. That
> > > > >>>>>>>
> > > > >>>>>> is
> > > > >>>
> > > > >>>> Hadoop.
> > > > >>>>>>>
> > > > >>>>>>> Regards!
> > > > >>>>>>>
> > > > >>>>>>> Kenneth
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Sent from my Mi phone
> > > > >>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM
> > wrote:
> > > > >>>>>>>
> > > > >>>>>>> Hi,
> > > > >>>>>>>
> > > > >>>>>>> Thanks for starting this thread. Here is my feedback.
> > > > >>>>>>>
> > > > >>>>>>> I somehow think the architecture is too complicated for wide
> > > > >>>>>>> adoption since it requires to install the following.
> > > > >>>>>>>
> > > > >>>>>>> HDFS.
> > > > >>>>>>> HIVE.
> > > > >>>>>>> IMPALA.
> > > > >>>>>>> KAFKA.
> > > > >>>>>>> SPARK (YARN).
> > > > >>>>>>> YARN.
> > > > >>>>>>> Zookeeper.
> > > > >>>>>>>
> > > > >>>>>>> Currently there are way too many dependencies that
> discourages
> > > > >>>>>>> lot of users from using it because they have to go through
> > > > >>>>>>> deployment of all that required software. I think for wide
> > > > >>>>>>> option we should minimize the dependencies and have more
> > > > >>>>>>> pluggable architecture. for example I am
> > > > >>>>>>>
> > > > >>>>>> not
> > > > >>>
> > > > >>>> sure why HIVE & IMPALA both are required? why not just use Spark
> > > > >>>> SQL
> > > > >>>>>>> since
> > > > >>>>>>> its already dependency or say users may want to use their own
> > > > >>>>>>> distributed query engine they like such as Apache Drill or
> > > > >>>>>>> something else. we should be flexible enough to provide that
> > > > >>>>>>> option
> > > > >>>>>>>
> > > > >>>>>>> Also, I see that HDFS is used such that collectors can
> receive
> > > > >>>>>>> file path's through Kafka and be able to read a file. How big
> > > > >>>>>>> are these files ?
> > > > >>>>>>> Do we
> > > > >>>>>>> really need HDFS for this? Why not provide more ways to send
> > > > >>>>>>> data such as sending data directly through Kafka or say just
> > > > >>>>>>> leaving up to the user to specify the file location as an
> > > > >>>>>>> argument to collector process
> > > > >>>>>>>
> > > > >>>>>>> Finally, I learnt that to generate Net flow data one would
> > > > >>>>>>> require a specific hardware. This really means Apache Spot is
> > > > >>>>>>> not meant for everyone.
> > > > >>>>>>> I thought Apache Spot can be used to analyze the network
> > traffic
> > > > >>>>>>> of
> > > > >>>>>>>
> > > > >>>>>> any
> > > > >>>
> > > > >>>> machine but if it requires a specific hard then I think it is
> > > > >>>>>>> targeted for
> > > > >>>>>>> specific group of people.
> > > > >>>>>>>
> > > > >>>>>>> The real strength of Apache Spot should mainly be just
> > analyzing
> > > > >>>>>>> network traffic through ML.
> > > > >>>>>>>
> > > > >>>>>>> Thanks!
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
> > > > >>>>>>> [email protected]> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Thanks, Nate,
> > > > >>>>>>>>
> > > > >>>>>>>> Nate.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> -----Original Message-----
> > > > >>>>>>>> From: Nate Smith [mailto:[email protected]]
> > > > >>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> > > > >>>>>>>> To: [email protected]
> > > > >>>>>>>> Cc: [email protected];
> > > > >>>>>>>>
> > > > >>>>>>> [email protected]
> > > > >>>
> > > > >>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > > > >>>>>>>>
> > > > >>>>>>>> I was really hoping it came through ok, Oh well :) Here’s an
> > > > >>>>>>>> image form:
> > > > >>>>>>>> http://imgur.com/a/DUDsD
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
> > > > >>>>>>>>>
> > > > >>>>>>>> [email protected]> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> The diagram became garbled in the text format.
> > > > >>>>>>>>> Could you resend it as a pdf?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks,
> > > > >>>>>>>>> Nate
> > > > >>>>>>>>>
> > > > >>>>>>>>> -----Original Message-----
> > > > >>>>>>>>> From: Nathanael Smith [mailto:[email protected]]
> > > > >>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> > > > >>>>>>>>> To: [email protected];
> > > > >>>>>>>>>
> > > > >>>>>>>> [email protected];
> > > > >>>>>>>
> > > > >>>>>>>> [email protected]
> > > > >>>>>>>>
> > > > >>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
> > > > >>>>>>>>>
> > > > >>>>>>>>> How would you like to see Spot-ingest change?
> > > > >>>>>>>>>
> > > > >>>>>>>>> A. continue development on the Python Master/Worker with
> > focus
> > > > >>>>>>>>> on
> > > > >>>>>>>>>
> > > > >>>>>>>> performance / error handling / logging B. Develop Scala
> based
> > > > >>>>>>>>
> > > > >>>>>>> ingest to
> > > > >>>>>>> be
> > > > >>>>>>>
> > > > >>>>>>>> inline with code base from ingest, ml, to OA (UI to continue
> > > > >>>>>>>> being
> > > > >>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark
> > code
> > > > >>>>>>>> for normalization and input into DB
> > > > >>>>>>>>
> > > > >>>>>>>>> Including the high level diagram:
> > > > >>>>>>>>> +-----------------------------
> ------------------------------
> > > > >>>>>>>>>
> > > > >>>>>>>> -------------------------------+
> > > > >>>>>>>>
> > > > >>>>>>>>> | +--------------------------+
> > > > >>>>>>>>>
> > > > >>>>>>>> +-----------------+        |
> > > > >>>>>>>>
> > > > >>>>>>>>> | | Master                   |  A. B. C.
> > >   |
> > > > >>>>>>>>>
> > > > >>>>>>>> Worker          |        |
> > > > >>>>>>>>
> > > > >>>>>>>>> | |    A. Python             +---------------+      A.
> > > > >>>>>>>>>
> > > > >>>>>>>> |   A.
> > > > >>>>>>>
> > > > >>>>>>>> Python     |        |
> > > > >>>>>>>>
> > > > >>>>>>>>> | |    B. Scala              |               |
> > > +------------->
> > > > >>>>>>>>>
> > > > >>>>>>>>           +----+   |
> > > > >>>>>>>>
> > > > >>>>>>>>> | |    C. Python             |               |    |
> > >  |
> > > > >>>>>>>>>
> > > > >>>>>>>>           |    |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> | +---^------+---------------+               |    |
> > > > >>>>>>>>>
> > > > >>>>>>>>   +-----------------+    |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |     |      |                               |    |
> > > > >>>>>>>>>
> > > > >>>>>>>>                |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |     |      |                               |    |
> > > > >>>>>>>>>
> > > > >>>>>>>>                |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |     |     +Note--------------+             |    |
> > > > >>>>>>>>>
> > > > >>>>>>>>   +-----------------+    |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |     |     |Running on a      |             |    |
> > >  |
> > > > >>>>>>>>>
> > > > >>>>>>>> Spark
> > > > >>>>>>>
> > > > >>>>>>>> Streaming |    |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |     |     |worker node in    |             |    |      B.
> > C.
> > > > >>>>>>>>>
> > > > >>>>>>>> | B.
> > > > >>>>>>>
> > > > >>>>>>>> Scala        |    |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |     |     |the Hadoop cluster|             |    |
> > > > >>>>>>>>>
> > > > >>>>>>>> +--------> C.
> > > > >>>>>>>
> > > > >>>>>>>> Scala        +-+  |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |     |     +------------------+             |    |    |
> > >   |
> > > > >>>>>>>>>
> > > > >>>>>>>>           | |  |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |   A.|                                      |    |    |
> > > > >>>>>>>>>
> > > > >>>>>>>> +-----------------+ |  |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |   B.|                                      |    |    |
> > > > >>>>>>>>>
> > > > >>>>>>>>              |  |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> |   C.|                                      |    |    |
> > > > >>>>>>>>>
> > > > >>>>>>>>              |  |   |
> > > > >>>>>>>>
> > > > >>>>>>>>> | +----------------------+          +-v------+----+----+-+
> > > > >>>>>>>>>
> > > > >>>>>>>>   +--------------v--v-+ |
> > > > >>>>>>>>
> > > > >>>>>>>>> | |                      |          |
> > > > >>>>>>>>>
> > > > >>>>>>>> |           |
> > > > >>>>>>>
> > > > >>>>>>>>                   | |
> > > > >>>>>>>>
> > > > >>>>>>>>> | |   Local FS:          |          |    hdfs
> > > > >>>>>>>>>
> > > > >>>>>>>> |           |
> > > > >>>>>>>
> > > > >>>>>>>> Hive / Impala    | |
> > > > >>>>>>>>
> > > > >>>>>>>>> | |  - Binary/Text       |          |
> > > > >>>>>>>>>
> > > > >>>>>>>> |           |
> > > > >>>>>>>
> > > > >>>>>>>>   - Parquet -     | |
> > > > >>>>>>>>
> > > > >>>>>>>>> | |    Log files -       |          |
> > > > >>>>>>>>>
> > > > >>>>>>>> |           |
> > > > >>>>>>>
> > > > >>>>>>>>                   | |
> > > > >>>>>>>>
> > > > >>>>>>>>> | |                      |          |
> > > > >>>>>>>>>
> > > > >>>>>>>> |           |
> > > > >>>>>>>
> > > > >>>>>>>>                   | |
> > > > >>>>>>>>
> > > > >>>>>>>>> | +----------------------+          +--------------------+
> > > > >>>>>>>>>
> > > > >>>>>>>>   +-------------------+ |
> > > > >>>>>>>>
> > > > >>>>>>>>> +-----------------------------
> ------------------------------
> > > > >>>>>>>>>
> > > > >>>>>>>> -------------------------------+
> > > > >>>>>>>>
> > > > >>>>>>>>> Please let me know your thoughts,
> > > > >>>>>>>>>
> > > > >>>>>>>>> - Nathanael
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>
> > > > >
> > > >
> > >
> >
>



-- 
Michael Ridley <[email protected]>
office: (650) 352-1337
mobile: (571) 438-2420
Senior Solutions Architect
Cloudera, Inc.

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to