Re: [Discuss] - Future plans for Spot-ingest

Michael Ridley Fri, 14 Apr 2017 10:55:59 -0700

+1 (non-binding) for Option C

Michael


On Fri, Apr 14, 2017 at 1:50 PM, <[email protected]> wrote:

> +1
>
>
> A 2017-04-14 17:59, Austin Leahy escrigué:
>
> I think that C is the strong solution, getting the ingest really strong is
>> going to lower barriers to adoption. Doing it in Python will open up the
>> ingest portion of the project to include many more developers.
>>
>> Before it comes up I would like to throw the following on the pile...
>> Major
>> python projects django/flash, others are dropping 2.x support in releases
>> scheduled in the next 6 to 8 months. Hadoop projects in general tend to
>> lag
>> in modern python support, lets please build this in 3.5 so that we don't
>> have to immediately expect a rebuild in the pipeline.
>>
>> -Vote C
>>
>> Thanks Nate
>>
>> Austin
>>
>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> wrote:
>>
>> I really like option C because it gives a lot of flexibility for ingest
>>> (python vs scala) but still has the robust spark streaming backend for
>>> performance.
>>>
>>> Thanks for putting this together Nate.
>>>
>>> Alan
>>>
>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
>>> [email protected]> wrote:
>>>
>>> > I agree. We should continue making the existing stack more mature at
>>> > this point. Maybe if we have enough community support we can add
>>> > additional datastores.
>>> >
>>> > Chokha.
>>> >
>>> >
>>> > On 4/14/17 11:10 AM, [email protected] wrote:
>>> > > Hi Kant,
>>> > >
>>> > >
>>> > > YARN is the standard scheduler in Hadoop. If you're using Hive+Spark,
>>> > > then sure you'll have YARN.
>>> > >
>>> > > Haven't seen any HIVE on Mesos so far. As said, Spot is based on a
>>> > > quite standard Hadoop stack and I wouldn't switch too many pieces
>>> yet.
>>> > >
>>> > > In most Opensource projects you start relying on a well-known stack
>>> > > and then you begin to support other DB backends once it's quite
>>> > > mature. Think in the loads of LAMP apps which haven't been ported
>>> away
>>> > > from MySQL yet.
>>> > >
>>> > > In any case, you'll need a high performance SQL + Massive Storage +
>>> > > Machine Learning + Massive Ingestion, and... ATM, that can be only
>>> > > provided by Hadoop.
>>> > >
>>> > > Regards!
>>> > >
>>> > > Kenneth
>>> > >
>>> > > A 2017-04-14 12:56, kant kodali escrigué:
>>> > >> Hi Kenneth,
>>> > >>
>>> > >> Thanks for the response.  I think you made a case for HDFS  however
>>> > >> users
>>> > >> may want to use S3 or some other FS in which case they can use
>>> Auxilio
>>> > >> (hoping that there are no changes needed within Spot in which case I
>>> can
>>> > >> agree to that). for example, Netflix stores all there data into S3
>>> > >>
>>> > >> The distributed sql query engine I would say should be pluggable
>>> with
>>> > >> whatever user may want to use and there a bunch of them out there.
>>> sure
>>> > >> Impala is better than hive but what if users are already using
>>> something
>>> > >> else like Drill or Presto?
>>> > >>
>>> > >> Me personally, would not assume that users are willing to deploy all
>>> of
>>> > >> that and make their existing stack more complicated at very least I
>>> > >> would
>>> > >> say it is a uphill battle. Things have been changing rapidly in Big
>>> data
>>> > >> space so whatever we think is standard won't be standard anymore but
>>> > >> importantly there shouldn't be any reason why we shouldn't be
>>> flexible
>>> > >> right.
>>> > >>
>>> > >> Also I am not sure why only YARN? why not make that also more
>>> > >> flexible so
>>> > >> users can pick Mesos or standalone.
>>> > >>
>>> > >> I think Flexibility is a key for a wide adoption rather than the
>>> tightly
>>> > >> coupled architecture.
>>> > >>
>>> > >> Thanks!
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <[email protected]
>>> >
>>> > >> wrote:
>>> > >>
>>> > >>> PS: you need a big data platform to be able to collect all those
>>> > >>> netflows
>>> > >>> and logs.
>>> > >>>
>>> > >>> Spot isn't intended for SMBs, that's clear, then you need loads of
>>> > >>> data to
>>> > >>> get ML working properly, and somewhere to run those algorithms.
>>> That
>>> is
>>> > >>> Hadoop.
>>> > >>>
>>> > >>> Regards!
>>> > >>>
>>> > >>> Kenneth
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Sent from my Mi phone
>>> > >>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM wrote:
>>> > >>>
>>> > >>> Hi,
>>> > >>>
>>> > >>> Thanks for starting this thread. Here is my feedback.
>>> > >>>
>>> > >>> I somehow think the architecture is too complicated for wide
>>> adoption
>>> > >>> since
>>> > >>> it requires to install the following.
>>> > >>>
>>> > >>> HDFS.
>>> > >>> HIVE.
>>> > >>> IMPALA.
>>> > >>> KAFKA.
>>> > >>> SPARK (YARN).
>>> > >>> YARN.
>>> > >>> Zookeeper.
>>> > >>>
>>> > >>> Currently there are way too many dependencies that discourages lot
>>> of
>>> > >>> users
>>> > >>> from using it because they have to go through deployment of all
>>> that
>>> > >>> required software. I think for wide option we should minimize the
>>> > >>> dependencies and have more pluggable architecture. for example I am
>>> not
>>> > >>> sure why HIVE & IMPALA both are required? why not just use Spark
>>> SQL
>>> > >>> since
>>> > >>> its already dependency or say users may want to use their own
>>> > >>> distributed
>>> > >>> query engine they like such as Apache Drill or something else. we
>>> > >>> should
>>> > >>> be
>>> > >>> flexible enough to provide that option
>>> > >>>
>>> > >>> Also, I see that HDFS is used such that collectors can receive file
>>> > >>> path's
>>> > >>> through Kafka and be able to read a file. How big are these files ?
>>> > >>> Do we
>>> > >>> really need HDFS for this? Why not provide more ways to send data
>>> > >>> such as
>>> > >>> sending data directly through Kafka or say just leaving up to the
>>> > >>> user to
>>> > >>> specify the file location as an argument to collector process
>>> > >>>
>>> > >>> Finally, I learnt that to generate Net flow data one would require
>>> a
>>> > >>> specific hardware. This really means Apache Spot is not meant for
>>> > >>> everyone.
>>> > >>> I thought Apache Spot can be used to analyze the network traffic of
>>> any
>>> > >>> machine but if it requires a specific hard then I think it is
>>> > >>> targeted for
>>> > >>> specific group of people.
>>> > >>>
>>> > >>> The real strength of Apache Spot should mainly be just analyzing
>>> > >>> network
>>> > >>> traffic through ML.
>>> > >>>
>>> > >>> Thanks!
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>>> > >>> [email protected]> wrote:
>>> > >>>
>>> > >>> > Thanks, Nate,
>>> > >>> >
>>> > >>> > Nate.
>>> > >>> >
>>> > >>> >
>>> > >>> > -----Original Message-----
>>> > >>> > From: Nate Smith [mailto:[email protected]]
>>> > >>> > Sent: Thursday, April 13, 2017 4:26 PM
>>> > >>> > To: [email protected]
>>> > >>> > Cc: [email protected];
>>> [email protected]
>>> > >>> > Subject: Re: [Discuss] - Future plans for Spot-ingest
>>> > >>> >
>>> > >>> > I was really hoping it came through ok,
>>> > >>> > Oh well :)
>>> > >>> > Here’s an image form:
>>> > >>> > http://imgur.com/a/DUDsD
>>> > >>> >
>>> > >>> >
>>> > >>> > > On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>>> > >>> > [email protected]> wrote:
>>> > >>> > >
>>> > >>> > > The diagram became garbled in the text format.
>>> > >>> > > Could you resend it as a pdf?
>>> > >>> > >
>>> > >>> > > Thanks,
>>> > >>> > > Nate
>>> > >>> > >
>>> > >>> > > -----Original Message-----
>>> > >>> > > From: Nathanael Smith [mailto:[email protected]]
>>> > >>> > > Sent: Thursday, April 13, 2017 4:01 PM
>>> > >>> > > To: [email protected];
>>> > >>> [email protected];
>>> > >>> > [email protected]
>>> > >>> > > Subject: [Discuss] - Future plans for Spot-ingest
>>> > >>> > >
>>> > >>> > > How would you like to see Spot-ingest change?
>>> > >>> > >
>>> > >>> > > A. continue development on the Python Master/Worker with focus
>>> on
>>> > >>> > performance / error handling / logging B. Develop Scala based
>>> > >>> ingest to
>>> > >>> be
>>> > >>> > inline with code base from ingest, ml, to OA (UI to continue
>>> being
>>> > >>> > ipython/JS) C. Python ingest Worker with Scala based Spark code
>>> for
>>> > >>> > normalization and input into DB
>>> > >>> > >
>>> > >>> > > Including the high level diagram:
>>> > >>> > > +-----------------------------------------------------------
>>> > >>> > -------------------------------+
>>> > >>> > > | +--------------------------+
>>> > >>> > +-----------------+        |
>>> > >>> > > | | Master                   |  A. B. C.
>>>   |
>>> > >>> > Worker          |        |
>>> > >>> > > | |    A. Python             +---------------+      A.
>>> > >>> |   A.
>>> > >>> > Python     |        |
>>> > >>> > > | |    B. Scala              |               |
>>> +------------->
>>> > >>> >          +----+   |
>>> > >>> > > | |    C. Python             |               |    |
>>>  |
>>> > >>> >          |    |   |
>>> > >>> > > | +---^------+---------------+               |    |
>>> > >>> >  +-----------------+    |   |
>>> > >>> > > |     |      |                               |    |
>>> > >>> >               |   |
>>> > >>> > > |     |      |                               |    |
>>> > >>> >               |   |
>>> > >>> > > |     |     +Note--------------+             |    |
>>> > >>> >  +-----------------+    |   |
>>> > >>> > > |     |     |Running on a      |             |    |
>>>  |
>>> > >>> Spark
>>> > >>> > Streaming |    |   |
>>> > >>> > > |     |     |worker node in    |             |    |      B. C.
>>> > >>> | B.
>>> > >>> > Scala        |    |   |
>>> > >>> > > |     |     |the Hadoop cluster|             |    |
>>> > >>> +--------> C.
>>> > >>> > Scala        +-+  |   |
>>> > >>> > > |     |     +------------------+             |    |    |
>>>   |
>>> > >>> >          | |  |   |
>>> > >>> > > |   A.|                                      |    |    |
>>> > >>> > +-----------------+ |  |   |
>>> > >>> > > |   B.|                                      |    |    |
>>> > >>> >             |  |   |
>>> > >>> > > |   C.|                                      |    |    |
>>> > >>> >             |  |   |
>>> > >>> > > | +----------------------+          +-v------+----+----+-+
>>> > >>> >  +--------------v--v-+ |
>>> > >>> > > | |                      |          |
>>> > >>> |           |
>>> > >>> >                  | |
>>> > >>> > > | |   Local FS:          |          |    hdfs
>>> > >>> |           |
>>> > >>> > Hive / Impala    | |
>>> > >>> > > | |  - Binary/Text       |          |
>>> > >>> |           |
>>> > >>> >  - Parquet -     | |
>>> > >>> > > | |    Log files -       |          |
>>> > >>> |           |
>>> > >>> >                  | |
>>> > >>> > > | |                      |          |
>>> > >>> |           |
>>> > >>> >                  | |
>>> > >>> > > | +----------------------+          +--------------------+
>>> > >>> >  +-------------------+ |
>>> > >>> > > +-----------------------------------------------------------
>>> > >>> > -------------------------------+
>>> > >>> > >
>>> > >>> > > Please let me know your thoughts,
>>> > >>> > >
>>> > >>> > > - Nathanael
>>> > >>> > >
>>> > >>> > >
>>> > >>> > >
>>> > >>> >
>>> > >>> >
>>> > >>>
>>> > >>>
>>> > >
>>> >
>>> >
>>>
>>>
>


-- 
Michael Ridley <[email protected]>
office: (650) 352-1337
mobile: (571) 438-2420
Senior Solutions Architect
Cloudera, Inc.

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to