Re: [Discuss] - Future plans for Spot-ingest

Mark Grover Fri, 14 Apr 2017 10:15:14 -0700

Hi Nate,
Thanks for starting this. I do feel strongly that it's best to keep the
workers in a JVM based language like Scala.
I don't have a strong feeling about the master but for reasons pointed
above by others, it may make sense to keep the master in Python.


So, Option C would be the vote from me as well.

On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <[email protected]> wrote:

> Option C is to use python on the front end of ingest pipeline and
> spark/scala on the back end.
>
> Option A uses python workers on the backend
>
> Option B uses all scala.
>
>
>
> -----Original Message-----
> From: kant kodali [mailto:[email protected]]
> Sent: Friday, April 14, 2017 9:53 AM
> To: [email protected]
> Subject: Re: [Discuss] - Future plans for Spot-ingest
>
> What is option C ? am I missing an email or something?
>
> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
> [email protected]> wrote:
>
> > +1 for Python 3.x
> >
> >
> >
> > On 4/14/2017 11:59 AM, Austin Leahy wrote:
> >
> >> I think that C is the strong solution, getting the ingest really
> >> strong is going to lower barriers to adoption. Doing it in Python
> >> will open up the ingest portion of the project to include many more
> developers.
> >>
> >> Before it comes up I would like to throw the following on the pile...
> >> Major
> >> python projects django/flash, others are dropping 2.x support in
> >> releases scheduled in the next 6 to 8 months. Hadoop projects in
> >> general tend to lag in modern python support, lets please build this
> >> in 3.5 so that we don't have to immediately expect a rebuild in the
> >> pipeline.
> >>
> >> -Vote C
> >>
> >> Thanks Nate
> >>
> >> Austin
> >>
> >> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> wrote:
> >>
> >> I really like option C because it gives a lot of flexibility for
> >> ingest
> >>> (python vs scala) but still has the robust spark streaming backend
> >>> for performance.
> >>>
> >>> Thanks for putting this together Nate.
> >>>
> >>> Alan
> >>>
> >>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
> >>> [email protected]> wrote:
> >>>
> >>> I agree. We should continue making the existing stack more mature at
> >>>> this point. Maybe if we have enough community support we can add
> >>>> additional datastores.
> >>>>
> >>>> Chokha.
> >>>>
> >>>>
> >>>> On 4/14/17 11:10 AM, [email protected] wrote:
> >>>>
> >>>>> Hi Kant,
> >>>>>
> >>>>>
> >>>>> YARN is the standard scheduler in Hadoop. If you're using
> >>>>> Hive+Spark, then sure you'll have YARN.
> >>>>>
> >>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based on a
> >>>>> quite standard Hadoop stack and I wouldn't switch too many pieces
> yet.
> >>>>>
> >>>>> In most Opensource projects you start relying on a well-known
> >>>>> stack and then you begin to support other DB backends once it's
> >>>>> quite mature. Think in the loads of LAMP apps which haven't been
> >>>>> ported away from MySQL yet.
> >>>>>
> >>>>> In any case, you'll need a high performance SQL + Massive Storage
> >>>>> + Machine Learning + Massive Ingestion, and... ATM, that can be
> >>>>> only provided by Hadoop.
> >>>>>
> >>>>> Regards!
> >>>>>
> >>>>> Kenneth
> >>>>>
> >>>>> A 2017-04-14 12:56, kant kodali escrigué:
> >>>>>
> >>>>>> Hi Kenneth,
> >>>>>>
> >>>>>> Thanks for the response.  I think you made a case for HDFS
> >>>>>> however users may want to use S3 or some other FS in which case
> >>>>>> they can use Auxilio (hoping that there are no changes needed
> >>>>>> within Spot in which case I
> >>>>>>
> >>>>> can
> >>>
> >>>> agree to that). for example, Netflix stores all there data into S3
> >>>>>>
> >>>>>> The distributed sql query engine I would say should be pluggable
> >>>>>> with whatever user may want to use and there a bunch of them out
> there.
> >>>>>>
> >>>>> sure
> >>>
> >>>> Impala is better than hive but what if users are already using
> >>>>>>
> >>>>> something
> >>>
> >>>> else like Drill or Presto?
> >>>>>>
> >>>>>> Me personally, would not assume that users are willing to deploy
> >>>>>> all
> >>>>>>
> >>>>> of
> >>>
> >>>> that and make their existing stack more complicated at very least I
> >>>>>> would
> >>>>>> say it is a uphill battle. Things have been changing rapidly in
> >>>>>> Big
> >>>>>>
> >>>>> data
> >>>
> >>>> space so whatever we think is standard won't be standard anymore
> >>>> but
> >>>>>> importantly there shouldn't be any reason why we shouldn't be
> >>>>>> flexible right.
> >>>>>>
> >>>>>> Also I am not sure why only YARN? why not make that also more
> >>>>>> flexible so users can pick Mesos or standalone.
> >>>>>>
> >>>>>> I think Flexibility is a key for a wide adoption rather than the
> >>>>>>
> >>>>> tightly
> >>>
> >>>> coupled architecture.
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> >>>>>> <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>> PS: you need a big data platform to be able to collect all those
> >>>>>>> netflows
> >>>>>>> and logs.
> >>>>>>>
> >>>>>>> Spot isn't intended for SMBs, that's clear, then you need loads
> >>>>>>> of data to get ML working properly, and somewhere to run those
> >>>>>>> algorithms. That
> >>>>>>>
> >>>>>> is
> >>>
> >>>> Hadoop.
> >>>>>>>
> >>>>>>> Regards!
> >>>>>>>
> >>>>>>> Kenneth
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Sent from my Mi phone
> >>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Thanks for starting this thread. Here is my feedback.
> >>>>>>>
> >>>>>>> I somehow think the architecture is too complicated for wide
> >>>>>>> adoption since it requires to install the following.
> >>>>>>>
> >>>>>>> HDFS.
> >>>>>>> HIVE.
> >>>>>>> IMPALA.
> >>>>>>> KAFKA.
> >>>>>>> SPARK (YARN).
> >>>>>>> YARN.
> >>>>>>> Zookeeper.
> >>>>>>>
> >>>>>>> Currently there are way too many dependencies that discourages
> >>>>>>> lot of users from using it because they have to go through
> >>>>>>> deployment of all that required software. I think for wide
> >>>>>>> option we should minimize the dependencies and have more
> >>>>>>> pluggable architecture. for example I am
> >>>>>>>
> >>>>>> not
> >>>
> >>>> sure why HIVE & IMPALA both are required? why not just use Spark
> >>>> SQL
> >>>>>>> since
> >>>>>>> its already dependency or say users may want to use their own
> >>>>>>> distributed query engine they like such as Apache Drill or
> >>>>>>> something else. we should be flexible enough to provide that
> >>>>>>> option
> >>>>>>>
> >>>>>>> Also, I see that HDFS is used such that collectors can receive
> >>>>>>> file path's through Kafka and be able to read a file. How big
> >>>>>>> are these files ?
> >>>>>>> Do we
> >>>>>>> really need HDFS for this? Why not provide more ways to send
> >>>>>>> data such as sending data directly through Kafka or say just
> >>>>>>> leaving up to the user to specify the file location as an
> >>>>>>> argument to collector process
> >>>>>>>
> >>>>>>> Finally, I learnt that to generate Net flow data one would
> >>>>>>> require a specific hardware. This really means Apache Spot is
> >>>>>>> not meant for everyone.
> >>>>>>> I thought Apache Spot can be used to analyze the network traffic
> >>>>>>> of
> >>>>>>>
> >>>>>> any
> >>>
> >>>> machine but if it requires a specific hard then I think it is
> >>>>>>> targeted for
> >>>>>>> specific group of people.
> >>>>>>>
> >>>>>>> The real strength of Apache Spot should mainly be just analyzing
> >>>>>>> network traffic through ML.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>> Thanks, Nate,
> >>>>>>>>
> >>>>>>>> Nate.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Nate Smith [mailto:[email protected]]
> >>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> >>>>>>>> To: [email protected]
> >>>>>>>> Cc: [email protected];
> >>>>>>>>
> >>>>>>> [email protected]
> >>>
> >>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> >>>>>>>>
> >>>>>>>> I was really hoping it came through ok, Oh well :) Here’s an
> >>>>>>>> image form:
> >>>>>>>> http://imgur.com/a/DUDsD
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
> >>>>>>>>>
> >>>>>>>> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>>> The diagram became garbled in the text format.
> >>>>>>>>> Could you resend it as a pdf?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Nate
> >>>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Nathanael Smith [mailto:[email protected]]
> >>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> >>>>>>>>> To: [email protected];
> >>>>>>>>>
> >>>>>>>> [email protected];
> >>>>>>>
> >>>>>>>> [email protected]
> >>>>>>>>
> >>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
> >>>>>>>>>
> >>>>>>>>> How would you like to see Spot-ingest change?
> >>>>>>>>>
> >>>>>>>>> A. continue development on the Python Master/Worker with focus
> >>>>>>>>> on
> >>>>>>>>>
> >>>>>>>> performance / error handling / logging B. Develop Scala based
> >>>>>>>>
> >>>>>>> ingest to
> >>>>>>> be
> >>>>>>>
> >>>>>>>> inline with code base from ingest, ml, to OA (UI to continue
> >>>>>>>> being
> >>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark code
> >>>>>>>> for normalization and input into DB
> >>>>>>>>
> >>>>>>>>> Including the high level diagram:
> >>>>>>>>> +-----------------------------------------------------------
> >>>>>>>>>
> >>>>>>>> -------------------------------+
> >>>>>>>>
> >>>>>>>>> | +--------------------------+
> >>>>>>>>>
> >>>>>>>> +-----------------+        |
> >>>>>>>>
> >>>>>>>>> | | Master                   |  A. B. C.                        |
> >>>>>>>>>
> >>>>>>>> Worker          |        |
> >>>>>>>>
> >>>>>>>>> | |    A. Python             +---------------+      A.
> >>>>>>>>>
> >>>>>>>> |   A.
> >>>>>>>
> >>>>>>>> Python     |        |
> >>>>>>>>
> >>>>>>>>> | |    B. Scala              |               |    +------------->
> >>>>>>>>>
> >>>>>>>>           +----+   |
> >>>>>>>>
> >>>>>>>>> | |    C. Python             |               |    |             |
> >>>>>>>>>
> >>>>>>>>           |    |   |
> >>>>>>>>
> >>>>>>>>> | +---^------+---------------+               |    |
> >>>>>>>>>
> >>>>>>>>   +-----------------+    |   |
> >>>>>>>>
> >>>>>>>>> |     |      |                               |    |
> >>>>>>>>>
> >>>>>>>>                |   |
> >>>>>>>>
> >>>>>>>>> |     |      |                               |    |
> >>>>>>>>>
> >>>>>>>>                |   |
> >>>>>>>>
> >>>>>>>>> |     |     +Note--------------+             |    |
> >>>>>>>>>
> >>>>>>>>   +-----------------+    |   |
> >>>>>>>>
> >>>>>>>>> |     |     |Running on a      |             |    |             |
> >>>>>>>>>
> >>>>>>>> Spark
> >>>>>>>
> >>>>>>>> Streaming |    |   |
> >>>>>>>>
> >>>>>>>>> |     |     |worker node in    |             |    |      B. C.
> >>>>>>>>>
> >>>>>>>> | B.
> >>>>>>>
> >>>>>>>> Scala        |    |   |
> >>>>>>>>
> >>>>>>>>> |     |     |the Hadoop cluster|             |    |
> >>>>>>>>>
> >>>>>>>> +--------> C.
> >>>>>>>
> >>>>>>>> Scala        +-+  |   |
> >>>>>>>>
> >>>>>>>>> |     |     +------------------+             |    |    |        |
> >>>>>>>>>
> >>>>>>>>           | |  |   |
> >>>>>>>>
> >>>>>>>>> |   A.|                                      |    |    |
> >>>>>>>>>
> >>>>>>>> +-----------------+ |  |   |
> >>>>>>>>
> >>>>>>>>> |   B.|                                      |    |    |
> >>>>>>>>>
> >>>>>>>>              |  |   |
> >>>>>>>>
> >>>>>>>>> |   C.|                                      |    |    |
> >>>>>>>>>
> >>>>>>>>              |  |   |
> >>>>>>>>
> >>>>>>>>> | +----------------------+          +-v------+----+----+-+
> >>>>>>>>>
> >>>>>>>>   +--------------v--v-+ |
> >>>>>>>>
> >>>>>>>>> | |                      |          |
> >>>>>>>>>
> >>>>>>>> |           |
> >>>>>>>
> >>>>>>>>                   | |
> >>>>>>>>
> >>>>>>>>> | |   Local FS:          |          |    hdfs
> >>>>>>>>>
> >>>>>>>> |           |
> >>>>>>>
> >>>>>>>> Hive / Impala    | |
> >>>>>>>>
> >>>>>>>>> | |  - Binary/Text       |          |
> >>>>>>>>>
> >>>>>>>> |           |
> >>>>>>>
> >>>>>>>>   - Parquet -     | |
> >>>>>>>>
> >>>>>>>>> | |    Log files -       |          |
> >>>>>>>>>
> >>>>>>>> |           |
> >>>>>>>
> >>>>>>>>                   | |
> >>>>>>>>
> >>>>>>>>> | |                      |          |
> >>>>>>>>>
> >>>>>>>> |           |
> >>>>>>>
> >>>>>>>>                   | |
> >>>>>>>>
> >>>>>>>>> | +----------------------+          +--------------------+
> >>>>>>>>>
> >>>>>>>>   +-------------------+ |
> >>>>>>>>
> >>>>>>>>> +-----------------------------------------------------------
> >>>>>>>>>
> >>>>>>>> -------------------------------+
> >>>>>>>>
> >>>>>>>>> Please let me know your thoughts,
> >>>>>>>>>
> >>>>>>>>> - Nathanael
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>
> >
>

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to