+1 for Python 3.x

On 4/14/2017 11:59 AM, Austin Leahy wrote:
I think that C is the strong solution, getting the ingest really strong is
going to lower barriers to adoption. Doing it in Python will open up the
ingest portion of the project to include many more developers.

Before it comes up I would like to throw the following on the pile... Major
python projects django/flash, others are dropping 2.x support in releases
scheduled in the next 6 to 8 months. Hadoop projects in general tend to lag
in modern python support, lets please build this in 3.5 so that we don't
have to immediately expect a rebuild in the pipeline.

-Vote C

Thanks Nate

Austin

On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> wrote:

I really like option C because it gives a lot of flexibility for ingest
(python vs scala) but still has the robust spark streaming backend for
performance.

Thanks for putting this together Nate.

Alan

On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
[email protected]> wrote:

I agree. We should continue making the existing stack more mature at
this point. Maybe if we have enough community support we can add
additional datastores.

Chokha.


On 4/14/17 11:10 AM, [email protected] wrote:
Hi Kant,


YARN is the standard scheduler in Hadoop. If you're using Hive+Spark,
then sure you'll have YARN.

Haven't seen any HIVE on Mesos so far. As said, Spot is based on a
quite standard Hadoop stack and I wouldn't switch too many pieces yet.

In most Opensource projects you start relying on a well-known stack
and then you begin to support other DB backends once it's quite
mature. Think in the loads of LAMP apps which haven't been ported away
from MySQL yet.

In any case, you'll need a high performance SQL + Massive Storage +
Machine Learning + Massive Ingestion, and... ATM, that can be only
provided by Hadoop.

Regards!

Kenneth

A 2017-04-14 12:56, kant kodali escrigué:
Hi Kenneth,

Thanks for the response.  I think you made a case for HDFS  however
users
may want to use S3 or some other FS in which case they can use Auxilio
(hoping that there are no changes needed within Spot in which case I
can
agree to that). for example, Netflix stores all there data into S3

The distributed sql query engine I would say should be pluggable with
whatever user may want to use and there a bunch of them out there.
sure
Impala is better than hive but what if users are already using
something
else like Drill or Presto?

Me personally, would not assume that users are willing to deploy all
of
that and make their existing stack more complicated at very least I
would
say it is a uphill battle. Things have been changing rapidly in Big
data
space so whatever we think is standard won't be standard anymore but
importantly there shouldn't be any reason why we shouldn't be flexible
right.

Also I am not sure why only YARN? why not make that also more
flexible so
users can pick Mesos or standalone.

I think Flexibility is a key for a wide adoption rather than the
tightly
coupled architecture.

Thanks!







On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <[email protected]>
wrote:

PS: you need a big data platform to be able to collect all those
netflows
and logs.

Spot isn't intended for SMBs, that's clear, then you need loads of
data to
get ML working properly, and somewhere to run those algorithms. That
is
Hadoop.

Regards!

Kenneth



Sent from my Mi phone
On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM wrote:

Hi,

Thanks for starting this thread. Here is my feedback.

I somehow think the architecture is too complicated for wide adoption
since
it requires to install the following.

HDFS.
HIVE.
IMPALA.
KAFKA.
SPARK (YARN).
YARN.
Zookeeper.

Currently there are way too many dependencies that discourages lot of
users
from using it because they have to go through deployment of all that
required software. I think for wide option we should minimize the
dependencies and have more pluggable architecture. for example I am
not
sure why HIVE & IMPALA both are required? why not just use Spark SQL
since
its already dependency or say users may want to use their own
distributed
query engine they like such as Apache Drill or something else. we
should
be
flexible enough to provide that option

Also, I see that HDFS is used such that collectors can receive file
path's
through Kafka and be able to read a file. How big are these files ?
Do we
really need HDFS for this? Why not provide more ways to send data
such as
sending data directly through Kafka or say just leaving up to the
user to
specify the file location as an argument to collector process

Finally, I learnt that to generate Net flow data one would require a
specific hardware. This really means Apache Spot is not meant for
everyone.
I thought Apache Spot can be used to analyze the network traffic of
any
machine but if it requires a specific hard then I think it is
targeted for
specific group of people.

The real strength of Apache Spot should mainly be just analyzing
network
traffic through ML.

Thanks!















On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
[email protected]> wrote:

Thanks, Nate,

Nate.


-----Original Message-----
From: Nate Smith [mailto:[email protected]]
Sent: Thursday, April 13, 2017 4:26 PM
To: [email protected]
Cc: [email protected];
[email protected]
Subject: Re: [Discuss] - Future plans for Spot-ingest

I was really hoping it came through ok,
Oh well :)
Here’s an image form:
http://imgur.com/a/DUDsD


On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
[email protected]> wrote:
The diagram became garbled in the text format.
Could you resend it as a pdf?

Thanks,
Nate

-----Original Message-----
From: Nathanael Smith [mailto:[email protected]]
Sent: Thursday, April 13, 2017 4:01 PM
To: [email protected];
[email protected];
[email protected]
Subject: [Discuss] - Future plans for Spot-ingest

How would you like to see Spot-ingest change?

A. continue development on the Python Master/Worker with focus on
performance / error handling / logging B. Develop Scala based
ingest to
be
inline with code base from ingest, ml, to OA (UI to continue being
ipython/JS) C. Python ingest Worker with Scala based Spark code for
normalization and input into DB
Including the high level diagram:
+-----------------------------------------------------------
-------------------------------+
| +--------------------------+
+-----------------+        |
| | Master                   |  A. B. C.                        |
Worker          |        |
| |    A. Python             +---------------+      A.
|   A.
Python     |        |
| |    B. Scala              |               |    +------------->
          +----+   |
| |    C. Python             |               |    |             |
          |    |   |
| +---^------+---------------+               |    |
  +-----------------+    |   |
|     |      |                               |    |
               |   |
|     |      |                               |    |
               |   |
|     |     +Note--------------+             |    |
  +-----------------+    |   |
|     |     |Running on a      |             |    |             |
Spark
Streaming |    |   |
|     |     |worker node in    |             |    |      B. C.
| B.
Scala        |    |   |
|     |     |the Hadoop cluster|             |    |
+--------> C.
Scala        +-+  |   |
|     |     +------------------+             |    |    |        |
          | |  |   |
|   A.|                                      |    |    |
+-----------------+ |  |   |
|   B.|                                      |    |    |
             |  |   |
|   C.|                                      |    |    |
             |  |   |
| +----------------------+          +-v------+----+----+-+
  +--------------v--v-+ |
| |                      |          |
|           |
                  | |
| |   Local FS:          |          |    hdfs
|           |
Hive / Impala    | |
| |  - Binary/Text       |          |
|           |
  - Parquet -     | |
| |    Log files -       |          |
|           |
                  | |
| |                      |          |
|           |
                  | |
| +----------------------+          +--------------------+
  +-------------------+ |
+-----------------------------------------------------------
-------------------------------+
Please let me know your thoughts,

- Nathanael







Reply via email to