Re: [Discuss] - Future plans for Spot-ingest

Chokha Palayamkottai Fri, 14 Apr 2017 08:45:26 -0700

I agree. We should continue making the existing stack more mature at
this point. Maybe if we have enough community support we can add
additional datastores.


Chokha.


On 4/14/17 11:10 AM, [email protected] wrote:
> Hi Kant,
>
>
> YARN is the standard scheduler in Hadoop. If you're using Hive+Spark,
> then sure you'll have YARN.
>
> Haven't seen any HIVE on Mesos so far. As said, Spot is based on a
> quite standard Hadoop stack and I wouldn't switch too many pieces yet.
>
> In most Opensource projects you start relying on a well-known stack
> and then you begin to support other DB backends once it's quite
> mature. Think in the loads of LAMP apps which haven't been ported away
> from MySQL yet.
>
> In any case, you'll need a high performance SQL + Massive Storage +
> Machine Learning + Massive Ingestion, and... ATM, that can be only
> provided by Hadoop.
>
> Regards!
>
> Kenneth
>
> A 2017-04-14 12:56, kant kodali escrigué:
>> Hi Kenneth,
>>
>> Thanks for the response.  I think you made a case for HDFS  however
>> users
>> may want to use S3 or some other FS in which case they can use Auxilio
>> (hoping that there are no changes needed within Spot in which case I can
>> agree to that). for example, Netflix stores all there data into S3
>>
>> The distributed sql query engine I would say should be pluggable with
>> whatever user may want to use and there a bunch of them out there. sure
>> Impala is better than hive but what if users are already using something
>> else like Drill or Presto?
>>
>> Me personally, would not assume that users are willing to deploy all of
>> that and make their existing stack more complicated at very least I
>> would
>> say it is a uphill battle. Things have been changing rapidly in Big data
>> space so whatever we think is standard won't be standard anymore but
>> importantly there shouldn't be any reason why we shouldn't be flexible
>> right.
>>
>> Also I am not sure why only YARN? why not make that also more
>> flexible so
>> users can pick Mesos or standalone.
>>
>> I think Flexibility is a key for a wide adoption rather than the tightly
>> coupled architecture.
>>
>> Thanks!
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <[email protected]>
>> wrote:
>>
>>> PS: you need a big data platform to be able to collect all those
>>> netflows
>>> and logs.
>>>
>>> Spot isn't intended for SMBs, that's clear, then you need loads of
>>> data to
>>> get ML working properly, and somewhere to run those algorithms. That is
>>> Hadoop.
>>>
>>> Regards!
>>>
>>> Kenneth
>>>
>>>
>>>
>>> Sent from my Mi phone
>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM wrote:
>>>
>>> Hi,
>>>
>>> Thanks for starting this thread. Here is my feedback.
>>>
>>> I somehow think the architecture is too complicated for wide adoption
>>> since
>>> it requires to install the following.
>>>
>>> HDFS.
>>> HIVE.
>>> IMPALA.
>>> KAFKA.
>>> SPARK (YARN).
>>> YARN.
>>> Zookeeper.
>>>
>>> Currently there are way too many dependencies that discourages lot of
>>> users
>>> from using it because they have to go through deployment of all that
>>> required software. I think for wide option we should minimize the
>>> dependencies and have more pluggable architecture. for example I am not
>>> sure why HIVE & IMPALA both are required? why not just use Spark SQL
>>> since
>>> its already dependency or say users may want to use their own
>>> distributed
>>> query engine they like such as Apache Drill or something else. we
>>> should
>>> be
>>> flexible enough to provide that option
>>>
>>> Also, I see that HDFS is used such that collectors can receive file
>>> path's
>>> through Kafka and be able to read a file. How big are these files ?
>>> Do we
>>> really need HDFS for this? Why not provide more ways to send data
>>> such as
>>> sending data directly through Kafka or say just leaving up to the
>>> user to
>>> specify the file location as an argument to collector process
>>>
>>> Finally, I learnt that to generate Net flow data one would require a
>>> specific hardware. This really means Apache Spot is not meant for
>>> everyone.
>>> I thought Apache Spot can be used to analyze the network traffic of any
>>> machine but if it requires a specific hard then I think it is
>>> targeted for
>>> specific group of people.
>>>
>>> The real strength of Apache Spot should mainly be just analyzing
>>> network
>>> traffic through ML.
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>>> [email protected]> wrote:
>>>
>>> > Thanks, Nate,
>>> >
>>> > Nate.
>>> >
>>> >
>>> > -----Original Message-----
>>> > From: Nate Smith [mailto:[email protected]]
>>> > Sent: Thursday, April 13, 2017 4:26 PM
>>> > To: [email protected]
>>> > Cc: [email protected]; [email protected]
>>> > Subject: Re: [Discuss] - Future plans for Spot-ingest
>>> >
>>> > I was really hoping it came through ok,
>>> > Oh well :)
>>> > Here’s an image form:
>>> > http://imgur.com/a/DUDsD
>>> >
>>> >
>>> > > On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>>> > [email protected]> wrote:
>>> > >
>>> > > The diagram became garbled in the text format.
>>> > > Could you resend it as a pdf?
>>> > >
>>> > > Thanks,
>>> > > Nate
>>> > >
>>> > > -----Original Message-----
>>> > > From: Nathanael Smith [mailto:[email protected]]
>>> > > Sent: Thursday, April 13, 2017 4:01 PM
>>> > > To: [email protected];
>>> [email protected];
>>> > [email protected]
>>> > > Subject: [Discuss] - Future plans for Spot-ingest
>>> > >
>>> > > How would you like to see Spot-ingest change?
>>> > >
>>> > > A. continue development on the Python Master/Worker with focus on
>>> > performance / error handling / logging B. Develop Scala based
>>> ingest to
>>> be
>>> > inline with code base from ingest, ml, to OA (UI to continue being
>>> > ipython/JS) C. Python ingest Worker with Scala based Spark code for
>>> > normalization and input into DB
>>> > >
>>> > > Including the high level diagram:
>>> > > +-----------------------------------------------------------
>>> > -------------------------------+
>>> > > | +--------------------------+
>>> > +-----------------+        |
>>> > > | | Master                   |  A. B. C.                        |
>>> > Worker          |        |
>>> > > | |    A. Python             +---------------+      A.         
>>> |   A.
>>> > Python     |        |
>>> > > | |    B. Scala              |               |    +------------->
>>> >          +----+   |
>>> > > | |    C. Python             |               |    |             |
>>> >          |    |   |
>>> > > | +---^------+---------------+               |    |
>>> >  +-----------------+    |   |
>>> > > |     |      |                               |    |
>>> >               |   |
>>> > > |     |      |                               |    |
>>> >               |   |
>>> > > |     |     +Note--------------+             |    |
>>> >  +-----------------+    |   |
>>> > > |     |     |Running on a      |             |    |             |
>>> Spark
>>> > Streaming |    |   |
>>> > > |     |     |worker node in    |             |    |      B. C. 
>>> | B.
>>> > Scala        |    |   |
>>> > > |     |     |the Hadoop cluster|             |    |   
>>> +--------> C.
>>> > Scala        +-+  |   |
>>> > > |     |     +------------------+             |    |    |        |
>>> >          | |  |   |
>>> > > |   A.|                                      |    |    |
>>> > +-----------------+ |  |   |
>>> > > |   B.|                                      |    |    |
>>> >             |  |   |
>>> > > |   C.|                                      |    |    |
>>> >             |  |   |
>>> > > | +----------------------+          +-v------+----+----+-+
>>> >  +--------------v--v-+ |
>>> > > | |                      |          |                   
>>> |           |
>>> >                  | |
>>> > > | |   Local FS:          |          |    hdfs           
>>> |           |
>>> > Hive / Impala    | |
>>> > > | |  - Binary/Text       |          |                   
>>> |           |
>>> >  - Parquet -     | |
>>> > > | |    Log files -       |          |                   
>>> |           |
>>> >                  | |
>>> > > | |                      |          |                   
>>> |           |
>>> >                  | |
>>> > > | +----------------------+          +--------------------+
>>> >  +-------------------+ |
>>> > > +-----------------------------------------------------------
>>> > -------------------------------+
>>> > >
>>> > > Please let me know your thoughts,
>>> > >
>>> > > - Nathanael
>>> > >
>>> > >
>>> > >
>>> >
>>> >
>>>
>>>
>

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to