Re: [Discuss] - Future plans for Spot-ingest

kant kodali Wed, 19 Apr 2017 13:56:55 -0700

"AVRO for ingestion, Parquet for storage." This makes sense to me.


On Wed, Apr 19, 2017 at 1:45 PM, Michael Ridley <[email protected]>
wrote:

> One point I wanted to add the other day is that we probably do need to
> write to Avro for the streaming ingest since Parquet doesn't do so great
> with streaming ingest.  But then we can convert from Avro to Parquet using
> whatever tool (SparkSQL, Hive, whatever) for query.  Whether the Avro is
> persisted is an open question in my mind.
>
> Michael
>
> On Wed, Apr 19, 2017 at 4:22 PM, <[email protected]> wrote:
>
> >
> > Replying to myself, AVRO for ingestion, Parquet for storage.
> >
> > Regards!
> >
> > Kenneth
> >
> > A 2017-04-19 22:05, Austin Leahy escrigué:
> >
> > I think Kafka is probably a red herring. It's an industry goto in the
> >> application world because because of redundancy but the type and volumes
> >> of
> >> network telemetry that we are talking about here will bog kafka down
> >> unless
> >> you dedicate really serious hardware to just the kafka implementation.
> >> It's
> >> essentially the next level of the problem that the team was already
> >> running
> >> into when rabbitMQ was queueing in data.
> >>
> >> On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <[email protected]> wrote:
> >>
> >> On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P <
> >>> [email protected]> wrote:
> >>>
> >>> > Mark,
> >>> >
> >>> > just digesting the below.
> >>> >
> >>> > Backing up in my thought process, I was thinking that the ingest
> master
> >>> > (first point of entry into the system) would want to put the data
> into
> >>> a
> >>> > standard serializable format. I was thinking that libraries (such as
> >>> > pyarrow in this case) could help by writing the data in parquet
> format
> >>> > early in the process. You are probably correct that at this point in
> >>> time
> >>> > it might not be worth the time and can be kept in the backlog.
> >>> > That being said, I still think the master should produce data in a
> >>> > standard format, what in your opinion (and I open this up of course
> to
> >>> > others) would be the most logical format?
> >>> > the most basic would be to just keep it as a .csv.
> >>> >
> >>> > The master will likely write data to a staging directory in HDFS
> where
> >>> the
> >>> > spark streaming job will pick it up for normalization/writing to
> >>> parquet
> >>> in
> >>> > the correct block sizes and partitions.
> >>> >
> >>>
> >>> Hi Nate,
> >>> Avro is usually preferred for such a standard format - because it
> >>> asserts a
> >>> schema (types, etc.) which CSV doesn't and it allows for schema
> evolution
> >>> which depending on the type of evolution, CSV may or may not support.
> >>> And, that's something I have seen being done very commonly.
> >>>
> >>> Now, if the data were in Kafka before it gets to master, one could
> argue
> >>> that the master could just send metadata to the workers (topic name,
> >>> partition number, offset start and end) and the workers could read from
> >>> Kafka directly. I do understand that'd be a much different architecture
> >>> than the current one, but if you think it's a good idea too, we could
> >>> document that, say in a JIRA, and (de-)prioritize it (and in line with
> >>> the
> >>> rest of the discussion on this thread, it's not the top-most priority).
> >>> Thoughts?
> >>>
> >>> - Nathanael
> >>> >
> >>> >
> >>> >
> >>> > > On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]> wrote:
> >>> > >
> >>> > > Thanks all your opinion.
> >>> > >
> >>> > > I think it's good to consider two things:
> >>> > > 1. What do (we think) users care about?
> >>> > > 2. What's the cost of changing things?
> >>> > >
> >>> > > About #1, I think users care more about what format data is written
> >>> than
> >>> > > how the data is written. I'd argue whether that uses Hive, MR, or a
> >>> > custom
> >>> > > Parquet writer is not as important to them as long as we maintain
> >>> > > data/format compatibility.
> >>> > > About #2, having worked on several projects, I find that it's
> rather
> >>> > > difficult to keep up with Parquet. Even in Spark, there are a few
> >>> > different
> >>> > > ways to write to Parquet - there's a regular mode, and a legacy
> mode
> >>> > > <https://github.com/apache/spark/blob/master/sql/core/
> >>> > src/main/scala/org/apache/spark/sql/execution/datasources/parquet/
> >>> > ParquetWriteSupport.scala#L44>
> >>> > > which
> >>> > > continues to cause confusion
> >>> > > <https://issues.apache.org/jira/browse/SPARK-20297> till date.
> >>> Parquet
> >>> > > itself is pretty dependent on Hadoop
> >>> > > <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&;
> >>> > q=hadoop&type=&utf8=%E2%9C%93>
> >>> > > and,
> >>> > > just integrating it with systems with a lot of developers (like
> Spark
> >>> > > <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&;
> >>> > espv=2&ie=UTF-8#q=spark+parquet+jiras>)
> >>> > > is still a lot of work.
> >>> > >
> >>> > > I personally think we should leverage higher level tools like Hive,
> >>> or
> >>> > > Spark to write data in widespread formats (Parquet, being a very
> good
> >>> > > example) but I personally wouldn't encourage us to manage the
> writers
> >>> > > ourselves.
> >>> > >
> >>> > > Thoughts?
> >>> > > Mark
> >>> > >
> >>> > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <
> >>> [email protected]
> >>> >
> >>> > > wrote:
> >>> > >
> >>> > >> Without having given it too terribly much thought, that seems like
> >>> an
> >>> OK
> >>> > >> approach.
> >>> > >>
> >>> > >> Michael
> >>> > >>
> >>> > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <
> >>> [email protected]>
> >>> > >> wrote:
> >>> > >>
> >>> > >>> I think the question is rather we can write the data generically
> to
> >>> > HDFS
> >>> > >>> as parquet without the use of hive/impala?
> >>> > >>>
> >>> > >>> Today we write parquet data using the hive/mapreduce method.
> >>> > >>> As part of the redesign i’d like to use libraries for this as
> >>> opposed
> >>> > to
> >>> > >> a
> >>> > >>> hadoop dependency.
> >>> > >>> I think it would be preferred to use the python master to write
> the
> >>> > data
> >>> > >>> into the format we want, then do normalization of the data in
> spark
> >>> > >>> streaming.
> >>> > >>> Any thoughts?
> >>> > >>>
> >>> > >>> - Nathanael
> >>> > >>>
> >>> > >>>
> >>> > >>>
> >>> > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley <
> >>> [email protected]>
> >>> > >>> wrote:
> >>> > >>>>
> >>> > >>>> I had thought that the plan was to write the data in Parquet in
> >>> HDFS
> >>> > >>>> ultimately.
> >>> > >>>>
> >>> > >>>> Michael
> >>> > >>>>
> >>> > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <
> [email protected]
> >>> >
> >>> > >>> wrote:
> >>> > >>>>
> >>> > >>>>> Hi Mark,
> >>> > >>>>>
> >>> > >>>>> Thank you so much for hearing my argument. And I definetly
> >>> understand
> >>> > >>> that
> >>> > >>>>> you guys have bunch of things to do. My only concern is that I
> >>> hope
> >>> > it
> >>> > >>>>> doesn't take too long too support other backends. For example
> >>> > @Kenneth
> >>> > >>> had
> >>> > >>>>> given an example of LAMP stack had not moved away from mysql
> yet
> >>> > which
> >>> > >>>>> essentially means its probably a decade ? I see that in the
> >>> current
> >>> > >>>>> architecture the results from with python multiprocessing or
> >>> Spark
> >>> > >>>>> Streaming are written back to HDFS and  If so, can we write
> them
> >>> in
> >>> > >>> parquet
> >>> > >>>>> format ? such that users should be able to plug in any query
> >>> engine
> >>> > >> but
> >>> > >>>>> again I am not pushing you guys to do this right away or
> anything
> >>> > just
> >>> > >>>>> seeing if there a way for me to get started in parallel and if
> >>> not
> >>> > >>>>> feasible, its fine I just wanted to share my 2 cents and I am
> >>> glad
> >>> my
> >>> > >>>>> argument is heard!
> >>> > >>>>>
> >>> > >>>>> Thanks much!
> >>> > >>>>>
> >>> > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]>
> >>> > wrote:
> >>> > >>>>>
> >>> > >>>>>> Hi Kant,
> >>> > >>>>>> Just wanted to make sure you don't feel like we are ignoring
> >>> your
> >>> > >>>>>> comment:-) I hear you about pluggability.
> >>> > >>>>>>
> >>> > >>>>>> The design can and should be pluggable but the project has one
> >>> stack
> >>> > >> it
> >>> > >>>>>> ships out of the box with, one stack that's the default stack
> in
> >>> the
> >>> > >>>>> sense
> >>> > >>>>>> that it's the most tested and so on. And, for us, that's our
> >>> current
> >>> > >>>>> stack.
> >>> > >>>>>> If we were to take Apache Hive as an example, it shipped (and
> >>> ships)
> >>> > >>> with
> >>> > >>>>>> MapReduce as the default configuration engine. At some point,
> >>> Apache
> >>> > >>> Tez
> >>> > >>>>>> came along and wanted Hive to run on Tez, so they made a bunch
> >>> of
> >>> > >>> things
> >>> > >>>>>> pluggable to run Hive on Tez (instead of the only option
> >>> up-until
> >>> > >> then:
> >>> > >>>>>> Hive-on-MR) and then Apache Spark came and re-used some of
> that
> >>> > >>>>>> pluggability and even added some more so Hive-on-Spark could
> >>> become
> >>> > a
> >>> > >>>>>> reality. In the same way, I don't think anyone disagrees here
> >>> that
> >>> > >>>>>> pluggabilty is a good thing but it's hard to do pluggability
> >>> right,
> >>> > >> and
> >>> > >>>>> at
> >>> > >>>>>> the right level, unless on has a clear use-case in mind.
> >>> > >>>>>>
> >>> > >>>>>> As a project, we have many things to do and I personally think
> >>> the
> >>> > >>>>> biggest
> >>> > >>>>>> bang for the buck for us in making Spot a really solid and the
> >>> best
> >>> > >>> cyber
> >>> > >>>>>> security solution isn't pluggability but the things we are
> >>> working
> >>> > on
> >>> > >>> - a
> >>> > >>>>>> better user interface, a common/unified approach to storing
> and
> >>> > >>> modeling
> >>> > >>>>>> data, etc.
> >>> > >>>>>>
> >>> > >>>>>> Having said that, we are open, if it's important to you or
> >>> someone
> >>> > >>> else,
> >>> > >>>>>> we'd be happy to receive and review those patches.
> >>> > >>>>>>
> >>> > >>>>>> Thanks!
> >>> > >>>>>> Mark
> >>> > >>>>>>
> >>> > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <
> >>> [email protected]
> >>> >
> >>> > >>>>> wrote:
> >>> > >>>>>>
> >>> > >>>>>>> Thanks Ross! and yes option C sounds good to me as well
> >>> however I
> >>> > >> just
> >>> > >>>>>>> think Distributed Sql query engine  and the resource manager
> >>> should
> >>> > >> be
> >>> > >>>>>>> pluggable.
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
> >>> > >> [email protected]>
> >>> > >>>>>>> wrote:
> >>> > >>>>>>>
> >>> > >>>>>>>> Option C is to use python on the front end of ingest
> pipeline
> >>> and
> >>> > >>>>>>>> spark/scala on the back end.
> >>> > >>>>>>>>
> >>> > >>>>>>>> Option A uses python workers on the backend
> >>> > >>>>>>>>
> >>> > >>>>>>>> Option B uses all scala.
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>> -----Original Message-----
> >>> > >>>>>>>> From: kant kodali [mailto:[email protected]]
> >>> > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
> >>> > >>>>>>>> To: [email protected]
> >>> > >>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> >>> > >>>>>>>>
> >>> > >>>>>>>> What is option C ? am I missing an email or something?
> >>> > >>>>>>>>
> >>> > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
> >>> > >>>>>>>> [email protected]> wrote:
> >>> > >>>>>>>>
> >>> > >>>>>>>>> +1 for Python 3.x
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
> >>> > >>>>>>>>>
> >>> > >>>>>>>>>> I think that C is the strong solution, getting the ingest
> >>> really
> >>> > >>>>>>>>>> strong is going to lower barriers to adoption. Doing it in
> >>> > Python
> >>> > >>>>>>>>>> will open up the ingest portion of the project to include
> >>> many
> >>> > >>>>> more
> >>> > >>>>>>>> developers.
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Before it comes up I would like to throw the following on
> >>> the
> >>> > >>>>>> pile...
> >>> > >>>>>>>>>> Major
> >>> > >>>>>>>>>> python projects django/flash, others are dropping 2.x
> >>> support
> >>> in
> >>> > >>>>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop
> >>> projects
> >>> in
> >>> > >>>>>>>>>> general tend to lag in modern python support, lets please
> >>> build
> >>> > >>>>> this
> >>> > >>>>>>>>>> in 3.5 so that we don't have to immediately expect a
> rebuild
> >>> in
> >>> > >>>>> the
> >>> > >>>>>>>>>> pipeline.
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> -Vote C
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Thanks Nate
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> Austin
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <
> [email protected]>
> >>> > >>>>> wrote:
> >>> > >>>>>>>>>>
> >>> > >>>>>>>>>> I really like option C because it gives a lot of
> flexibility
> >>> for
> >>> > >>>>>>>>>> ingest
> >>> > >>>>>>>>>>> (python vs scala) but still has the robust spark
> streaming
> >>> > >>>>> backend
> >>> > >>>>>>>>>>> for performance.
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> Thanks for putting this together Nate.
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> Alan
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
> >>> > >>>>>>>>>>> [email protected]> wrote:
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>> I agree. We should continue making the existing stack
> more
> >>> > >> mature
> >>> > >>>>>> at
> >>> > >>>>>>>>>>>> this point. Maybe if we have enough community support we
> >>> can
> >>> > >> add
> >>> > >>>>>>>>>>>> additional datastores.
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>> Chokha.
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote:
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> Hi Kant,
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're
> using
> >>> > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is
> >>> based
> >>> > >>>>> on
> >>> > >>>>>> a
> >>> > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too
> >>> many
> >>> > >>>>> pieces
> >>> > >>>>>>>> yet.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> In most Opensource projects you start relying on a
> >>> well-known
> >>> > >>>>>>>>>>>>> stack and then you begin to support other DB backends
> >>> once
> >>> > >> it's
> >>> > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps which
> >>> haven't
> >>> > >>>>> been
> >>> > >>>>>>>>>>>>> ported away from MySQL yet.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> In any case, you'll need a high performance SQL +
> Massive
> >>> > >>>>> Storage
> >>> > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM,
> that
> >>> can
> >>> > >> be
> >>> > >>>>>>>>>>>>> only provided by Hadoop.
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> Regards!
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> Kenneth
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
> >>> > >>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Hi Kenneth,
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Thanks for the response.  I think you made a case for
> >>> HDFS
> >>> > >>>>>>>>>>>>>> however users may want to use S3 or some other FS in
> >>> which
> >>> > >>>>> case
> >>> > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no changes
> >>> > needed
> >>> > >>>>>>>>>>>>>> within Spot in which case I
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> can
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> agree to that). for example, Netflix stores all there
> data
> >>> > into
> >>> > >>>>> S3
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> The distributed sql query engine I would say should be
> >>> > >>>>> pluggable
> >>> > >>>>>>>>>>>>>> with whatever user may want to use and there a bunch
> of
> >>> them
> >>> > >>>>> out
> >>> > >>>>>>>> there.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> sure
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> Impala is better than hive but what if users are already
> >>> using
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> something
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> else like Drill or Presto?
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Me personally, would not assume that users are willing
> >>> to
> >>> > >>>>> deploy
> >>> > >>>>>>>>>>>>>> all
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> of
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> that and make their existing stack more complicated at
> >>> very
> >>> > >>>>> least
> >>> > >>>>>> I
> >>> > >>>>>>>>>>>>>> would
> >>> > >>>>>>>>>>>>>> say it is a uphill battle. Things have been changing
> >>> rapidly
> >>> > >>>>> in
> >>> > >>>>>>>>>>>>>> Big
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> data
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> space so whatever we think is standard won't be standard
> >>> > >> anymore
> >>> > >>>>>>>>>>>> but
> >>> > >>>>>>>>>>>>>> importantly there shouldn't be any reason why we
> >>> shouldn't
> >>> > be
> >>> > >>>>>>>>>>>>>> flexible right.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make that
> also
> >>> > more
> >>> > >>>>>>>>>>>>>> flexible so users can pick Mesos or standalone.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> I think Flexibility is a key for a wide adoption
> rather
> >>> than
> >>> > >>>>> the
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>> tightly
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> coupled architecture.
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> Thanks!
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> >>> > >>>>>>>>>>>>>> <[email protected]>
> >>> > >>>>>>>>>>>>>> wrote:
> >>> > >>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> PS: you need a big data platform to be able to collect
> >>> all
> >>> > >>>>> those
> >>> > >>>>>>>>>>>>>>> netflows
> >>> > >>>>>>>>>>>>>>> and logs.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you
> >>> need
> >>> > >>>>> loads
> >>> > >>>>>>>>>>>>>>> of data to get ML working properly, and somewhere to
> >>> run
> >>> > >>>>> those
> >>> > >>>>>>>>>>>>>>> algorithms. That
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> is
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> Hadoop.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Regards!
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Kenneth
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Sent from my Mi phone
> >>> > >>>>>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017
> 4:04
> >>> AM
> >>> > >>>>>> wrote:
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Hi,
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my feedback.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> I somehow think the architecture is too complicated
> for
> >>> > wide
> >>> > >>>>>>>>>>>>>>> adoption since it requires to install the following.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> HDFS.
> >>> > >>>>>>>>>>>>>>> HIVE.
> >>> > >>>>>>>>>>>>>>> IMPALA.
> >>> > >>>>>>>>>>>>>>> KAFKA.
> >>> > >>>>>>>>>>>>>>> SPARK (YARN).
> >>> > >>>>>>>>>>>>>>> YARN.
> >>> > >>>>>>>>>>>>>>> Zookeeper.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Currently there are way too many dependencies that
> >>> > >>>>> discourages
> >>> > >>>>>>>>>>>>>>> lot of users from using it because they have to go
> >>> through
> >>> > >>>>>>>>>>>>>>> deployment of all that required software. I think for
> >>> wide
> >>> > >>>>>>>>>>>>>>> option we should minimize the dependencies and have
> >>> more
> >>> > >>>>>>>>>>>>>>> pluggable architecture. for example I am
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> not
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why not just
> use
> >>> > >> Spark
> >>> > >>>>>>>>>>>> SQL
> >>> > >>>>>>>>>>>>>>> since
> >>> > >>>>>>>>>>>>>>> its already dependency or say users may want to use
> >>> their
> >>> > >> own
> >>> > >>>>>>>>>>>>>>> distributed query engine they like such as Apache
> Drill
> >>> or
> >>> > >>>>>>>>>>>>>>> something else. we should be flexible enough to
> provide
> >>> > that
> >>> > >>>>>>>>>>>>>>> option
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that collectors
> can
> >>> > >>>>> receive
> >>> > >>>>>>>>>>>>>>> file path's through Kafka and be able to read a file.
> >>> How
> >>> > >> big
> >>> > >>>>>>>>>>>>>>> are these files ?
> >>> > >>>>>>>>>>>>>>> Do we
> >>> > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide more ways
> to
> >>> > send
> >>> > >>>>>>>>>>>>>>> data such as sending data directly through Kafka or
> say
> >>> > just
> >>> > >>>>>>>>>>>>>>> leaving up to the user to specify the file location
> as
> >>> an
> >>> > >>>>>>>>>>>>>>> argument to collector process
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow data one
> >>> would
> >>> > >>>>>>>>>>>>>>> require a specific hardware. This really means Apache
> >>> Spot
> >>> > >> is
> >>> > >>>>>>>>>>>>>>> not meant for everyone.
> >>> > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze the
> >>> network
> >>> > >>>>>> traffic
> >>> > >>>>>>>>>>>>>>> of
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>> any
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> machine but if it requires a specific hard then I think
> it
> >>> is
> >>> > >>>>>>>>>>>>>>> targeted for
> >>> > >>>>>>>>>>>>>>> specific group of people.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> The real strength of Apache Spot should mainly be
> just
> >>> > >>>>>> analyzing
> >>> > >>>>>>>>>>>>>>> network traffic through ML.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Thanks!
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L
> <
> >>> > >>>>>>>>>>>>>>> [email protected]> wrote:
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> Thanks, Nate,
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Nate.
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> -----Original Message-----
> >>> > >>>>>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]]
> >>> > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> >>> > >>>>>>>>>>>>>>>> To: [email protected]
> >>> > >>>>>>>>>>>>>>>> Cc: [email protected];
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> [email protected]
> >>> > >>>>>>>>>>>
> >>> > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh well :)
> >>> Here’s
> >>> > >> an
> >>> > >>>>>>>>>>>>>>>> image form:
> >>> > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> [email protected]> wrote:
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> The diagram became garbled in the text format.
> >>> > >>>>>>>>>>>>>>>>> Could you resend it as a pdf?
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> Thanks,
> >>> > >>>>>>>>>>>>>>>>> Nate
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> -----Original Message-----
> >>> > >>>>>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]
> ]
> >>> > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> >>> > >>>>>>>>>>>>>>>>> To: [email protected];
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> [email protected];
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> [email protected]
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest change?
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> A. continue development on the Python Master/Worker
> >>> with
> >>> > >>>>>> focus
> >>> > >>>>>>>>>>>>>>>>> on
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> performance / error handling / logging B. Develop
> >>> Scala
> >>> > >>>>> based
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>> ingest to
> >>> > >>>>>>>>>>>>>>> be
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to
> >>> > >> continue
> >>> > >>>>>>>>>>>>>>>> being
> >>> > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based
> >>> Spark
> >>> > >>>>>> code
> >>> > >>>>>>>>>>>>>>>> for normalization and input into DB
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> Including the high level diagram:
> >>> > >>>>>>>>>>>>>>>>> +-----------------------------
> >>> > >>>>> ------------------------------
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> -------------------------------+
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | +--------------------------+
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> +-----------------+        |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | | Master                   |  A. B. C.
> >>> > >>>>>>> |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Worker          |        |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | |    A. Python             +---------------+
> >>> A.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> |   A.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Python     |        |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | |    B. Scala              |               |
> >>> > >>>>>>> +------------->
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>         +----+   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | |    C. Python             |               |    |
> >>> > >>>>>>> |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>         |    |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | +---^------+---------------+               |    |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> +-----------------+    |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |     |      |                               |    |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>              |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |     |      |                               |    |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>              |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |     |     +Note--------------+             |    |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> +-----------------+    |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |     |     |Running on a      |             |    |
> >>> > >>>>>>> |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Spark
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Streaming |    |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |     |     |worker node in    |             |    |
> >>> > >> B.
> >>> > >>>>>> C.
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> | B.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Scala        |    |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |     |     |the Hadoop cluster|             |    |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> +--------> C.
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Scala        +-+  |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |     |     +------------------+             |    |
> >>> |
> >>> > >>>>>>> |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>         | |  |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |   A.|                                      |    |
> >>> |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> +-----------------+ |  |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |   B.|                                      |    |
> >>> |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>            |  |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> |   C.|                                      |    |
> >>> |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>            |  |   |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | +----------------------+
> >>> > +-v------+----+----+-+
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> +--------------v--v-+ |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | |                      |          |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> |           |
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>                 | |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | |   Local FS:          |          |    hdfs
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> |           |
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> Hive / Impala    | |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | |  - Binary/Text       |          |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> |           |
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> - Parquet -     | |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | |    Log files -       |          |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> |           |
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>                 | |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | |                      |          |
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> |           |
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>                 | |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> | +----------------------+
> >>> > +--------------------+
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> +-------------------+ |
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> +-----------------------------
> >>> > >>>>> ------------------------------
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>> -------------------------------+
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> Please let me know your thoughts,
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>> - Nathanael
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>>>>
> >>> > >>>>>>>>>>>>
> >>> > >>>>>>>>>
> >>> > >>>>>>>>
> >>> > >>>>>>>
> >>> > >>>>>>
> >>> > >>>>>
> >>> > >>>>
> >>> > >>>>
> >>> > >>>>
> >>> > >>>> --
> >>> > >>>> Michael Ridley <[email protected]>
> >>> > >>>> office: (650) 352-1337
> >>> > >>>> mobile: (571) 438-2420
> >>> > >>>> Senior Solutions Architect
> >>> > >>>> Cloudera, Inc.
> >>> > >>>
> >>> > >>>
> >>> > >>
> >>> > >>
> >>> > >> --
> >>> > >> Michael Ridley <[email protected]>
> >>> > >> office: (650) 352-1337
> >>> > >> mobile: (571) 438-2420
> >>> > >> Senior Solutions Architect
> >>> > >> Cloudera, Inc.
> >>> > >>
> >>> >
> >>> >
> >>>
> >>>
> >
>
>
> --
> Michael Ridley <[email protected]>
> office: (650) 352-1337
> mobile: (571) 438-2420
> Senior Solutions Architect
> Cloudera, Inc.
>

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to