Re: [Discuss] - Future plans for Spot-ingest

Michael Ridley Thu, 20 Apr 2017 08:16:12 -0700

When we say ingest from Kafka, what does that mean?  I understand we can
read from Kafka to ingest into the cluster, but how will the data get to
Kafka and what data are we talking about?  My understanding is that right
now the primary data sources would be Netflow and Syslog, neither of which
writes to Kafka natively so we would need something like StreamSets in the
middle.  Certainly StreamSets UDP source -> Kafka would work.


Michael

On Wed, Apr 19, 2017 at 7:05 PM, kant kodali <[email protected]> wrote:

> sure I guess Kafka has something called Kafka connect but may not be as
> mature as flume since I heard about this recently.
>
> On Wed, Apr 19, 2017 at 3:39 PM, Austin Leahy <[email protected]>
> wrote:
>
> > The advantage of flume or a flume Kafka hybrid is that the team doesn't
> > have to build sinks for any new source types added to the project just
> > create configs pointing to the landing pad
> > On Wed, Apr 19, 2017 at 3:31 PM kant kodali <[email protected]> wrote:
> >
> > > What kind of benchmarks are we looking for? just throughput? since I am
> > > assuming this is for ingestion. I haven't seen anything faster than
> Kafka
> > > and that is because of its simplicity after all publisher appends
> message
> > > to a file(so called the partition in kafka) and clients just do
> > sequential
> > > reads from a file so its a matter of disk throughput. The benchmark
> > numbers
> > > I have for Kafka is at very least 75K messages/sec where each message
> is
> > > 1KB on m4.xlarge which by default has EBS storage (EBS is
> > network-attached
> > > SSD disk). The network attached disk has a max throughput of
> > > 125MB/s(m4.xlarge has 1Gigabit) but if we were deploy it on ephemeral
> > > storage (local-SSD) and on a 10 Gigabit Network we would easily get
> 5-10X
> > > more.
> > >
> > > No idea about flume.
> > >
> > > Finally, not trying to pitch for Kafka however it is fastest I have
> seen
> > > but if someone has better numbers for flume then we should use that.
> > Also I
> > > would suspect there are benchmarks for Kafka vs Flume available online
> > > already or we can try it with our own datasets.
> > >
> > > Thanks!
> > >
> > > On Wed, Apr 19, 2017 at 3:09 PM, Austin Leahy <
> [email protected]>
> > > wrote:
> > >
> > > > I am happy to create and test a flume source... #intelteam would need
> > to
> > > > create the benchmark by deploying it and pointing a data source at
> > it...
> > > > since I don't have good enough volume of source data handy
> > > > On Wed, Apr 19, 2017 at 3:04 PM Ross, Alan D <[email protected]>
> > > > wrote:
> > > >
> > > > > We discussed this in our staff meeting a bit today.  I would like
> to
> > > see
> > > > > some benchmarking of different approaches (kafka, flume, etc) to
> see
> > > what
> > > > > the numbers look like. Is anyone in the community willing to
> > volunteer
> > > on
> > > > > this work?
> > > > >
> > > > > -----Original Message-----
> > > > > From: Austin Leahy [mailto:[email protected]]
> > > > > Sent: Wednesday, April 19, 2017 1:05 PM
> > > > > To: [email protected]
> > > > > Subject: Re: [Discuss] - Future plans for Spot-ingest
> > > > >
> > > > > I think Kafka is probably a red herring. It's an industry goto in
> the
> > > > > application world because because of redundancy but the type and
> > > volumes
> > > > of
> > > > > network telemetry that we are talking about here will bog kafka
> down
> > > > unless
> > > > > you dedicate really serious hardware to just the kafka
> > implementation.
> > > > It's
> > > > > essentially the next level of the problem that the team was already
> > > > running
> > > > > into when rabbitMQ was queueing in data.
> > > > >
> > > > > On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <[email protected]>
> > wrote:
> > > > >
> > > > > > On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Mark,
> > > > > > >
> > > > > > > just digesting the below.
> > > > > > >
> > > > > > > Backing up in my thought process, I was thinking that the
> ingest
> > > > > > > master (first point of entry into the system) would want to put
> > the
> > > > > > > data into a standard serializable format. I was thinking that
> > > > > > > libraries (such as pyarrow in this case) could help by writing
> > the
> > > > > > > data in parquet format early in the process. You are probably
> > > > > > > correct that at this point in time it might not be worth the
> time
> > > and
> > > > > can be kept in the backlog.
> > > > > > > That being said, I still think the master should produce data
> in
> > a
> > > > > > > standard format, what in your opinion (and I open this up of
> > course
> > > > > > > to
> > > > > > > others) would be the most logical format?
> > > > > > > the most basic would be to just keep it as a .csv.
> > > > > > >
> > > > > > > The master will likely write data to a staging directory in
> HDFS
> > > > > > > where
> > > > > > the
> > > > > > > spark streaming job will pick it up for normalization/writing
> to
> > > > > > > parquet
> > > > > > in
> > > > > > > the correct block sizes and partitions.
> > > > > > >
> > > > > >
> > > > > > Hi Nate,
> > > > > > Avro is usually preferred for such a standard format - because it
> > > > > > asserts a schema (types, etc.) which CSV doesn't and it allows
> for
> > > > > > schema evolution which depending on the type of evolution, CSV
> may
> > or
> > > > > may not support.
> > > > > > And, that's something I have seen being done very commonly.
> > > > > >
> > > > > > Now, if the data were in Kafka before it gets to master, one
> could
> > > > > > argue that the master could just send metadata to the workers
> > (topic
> > > > > > name, partition number, offset start and end) and the workers
> could
> > > > > > read from Kafka directly. I do understand that'd be a much
> > different
> > > > > > architecture than the current one, but if you think it's a good
> > idea
> > > > > > too, we could document that, say in a JIRA, and (de-)prioritize
> it
> > > > > > (and in line with the rest of the discussion on this thread, it's
> > not
> > > > > the top-most priority).
> > > > > > Thoughts?
> > > > > >
> > > > > > - Nathanael
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]>
> > > wrote:
> > > > > > > >
> > > > > > > > Thanks all your opinion.
> > > > > > > >
> > > > > > > > I think it's good to consider two things:
> > > > > > > > 1. What do (we think) users care about?
> > > > > > > > 2. What's the cost of changing things?
> > > > > > > >
> > > > > > > > About #1, I think users care more about what format data is
> > > > > > > > written
> > > > > > than
> > > > > > > > how the data is written. I'd argue whether that uses Hive,
> MR,
> > or
> > > > > > > > a
> > > > > > > custom
> > > > > > > > Parquet writer is not as important to them as long as we
> > maintain
> > > > > > > > data/format compatibility.
> > > > > > > > About #2, having worked on several projects, I find that it's
> > > > > > > > rather difficult to keep up with Parquet. Even in Spark,
> there
> > > are
> > > > > > > > a few
> > > > > > > different
> > > > > > > > ways to write to Parquet - there's a regular mode, and a
> legacy
> > > > > > > > mode <https://github.com/apache/spark/blob/master/sql/core/
> > > > > > > src/main/scala/org/apache/spark/sql/execution/
> > datasources/parquet/
> > > > > > > ParquetWriteSupport.scala#L44>
> > > > > > > > which
> > > > > > > > continues to cause confusion
> > > > > > > > <https://issues.apache.org/jira/browse/SPARK-20297> till
> date.
> > > > > > > > Parquet itself is pretty dependent on Hadoop
> > > > > > > > <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&;
> > > > > > > q=hadoop&type=&utf8=%E2%9C%93>
> > > > > > > > and,
> > > > > > > > just integrating it with systems with a lot of developers
> (like
> > > > > > > > Spark <
> > > https://www.google.com/webhp?sourceid=chrome-instant&ion=1&;
> > > > > > > espv=2&ie=UTF-8#q=spark+parquet+jiras>)
> > > > > > > > is still a lot of work.
> > > > > > > >
> > > > > > > > I personally think we should leverage higher level tools like
> > > > > > > > Hive, or Spark to write data in widespread formats (Parquet,
> > > being
> > > > > > > > a very good
> > > > > > > > example) but I personally wouldn't encourage us to manage the
> > > > > > > > writers ourselves.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > > Mark
> > > > > > > >
> > > > > > > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley
> > > > > > > > <[email protected]
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Without having given it too terribly much thought, that
> seems
> > > > > > > >> like an
> > > > > > OK
> > > > > > > >> approach.
> > > > > > > >>
> > > > > > > >> Michael
> > > > > > > >>
> > > > > > > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <
> > > > > > [email protected]>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >>> I think the question is rather we can write the data
> > > generically
> > > > > > > >>> to
> > > > > > > HDFS
> > > > > > > >>> as parquet without the use of hive/impala?
> > > > > > > >>>
> > > > > > > >>> Today we write parquet data using the hive/mapreduce
> method.
> > > > > > > >>> As part of the redesign i’d like to use libraries for this
> as
> > > > > > > >>> opposed
> > > > > > > to
> > > > > > > >> a
> > > > > > > >>> hadoop dependency.
> > > > > > > >>> I think it would be preferred to use the python master to
> > write
> > > > > > > >>> the
> > > > > > > data
> > > > > > > >>> into the format we want, then do normalization of the data
> in
> > > > > > > >>> spark streaming.
> > > > > > > >>> Any thoughts?
> > > > > > > >>>
> > > > > > > >>> - Nathanael
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley
> > > > > > > >>>> <[email protected]>
> > > > > > > >>> wrote:
> > > > > > > >>>>
> > > > > > > >>>> I had thought that the plan was to write the data in
> Parquet
> > > in
> > > > > > > >>>> HDFS ultimately.
> > > > > > > >>>>
> > > > > > > >>>> Michael
> > > > > > > >>>>
> > > > > > > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali
> > > > > > > >>>> <[email protected]>
> > > > > > > >>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> Hi Mark,
> > > > > > > >>>>>
> > > > > > > >>>>> Thank you so much for hearing my argument. And I
> definetly
> > > > > > understand
> > > > > > > >>> that
> > > > > > > >>>>> you guys have bunch of things to do. My only concern is
> > that
> > > I
> > > > > > > >>>>> hope
> > > > > > > it
> > > > > > > >>>>> doesn't take too long too support other backends. For
> > example
> > > > > > > @Kenneth
> > > > > > > >>> had
> > > > > > > >>>>> given an example of LAMP stack had not moved away from
> > mysql
> > > > > > > >>>>> yet
> > > > > > > which
> > > > > > > >>>>> essentially means its probably a decade ? I see that in
> the
> > > > > > > >>>>> current architecture the results from with python
> > > > > > > >>>>> multiprocessing or Spark Streaming are written back to
> HDFS
> > > > > > > >>>>> and  If so, can we write them in
> > > > > > > >>> parquet
> > > > > > > >>>>> format ? such that users should be able to plug in any
> > query
> > > > > > > >>>>> engine
> > > > > > > >> but
> > > > > > > >>>>> again I am not pushing you guys to do this right away or
> > > > > > > >>>>> anything
> > > > > > > just
> > > > > > > >>>>> seeing if there a way for me to get started in parallel
> and
> > > if
> > > > > > > >>>>> not feasible, its fine I just wanted to share my 2 cents
> > and
> > > I
> > > > > > > >>>>> am glad
> > > > > > my
> > > > > > > >>>>> argument is heard!
> > > > > > > >>>>>
> > > > > > > >>>>> Thanks much!
> > > > > > > >>>>>
> > > > > > > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >>>>>
> > > > > > > >>>>>> Hi Kant,
> > > > > > > >>>>>> Just wanted to make sure you don't feel like we are
> > ignoring
> > > > > > > >>>>>> your
> > > > > > > >>>>>> comment:-) I hear you about pluggability.
> > > > > > > >>>>>>
> > > > > > > >>>>>> The design can and should be pluggable but the project
> has
> > > > > > > >>>>>> one
> > > > > > stack
> > > > > > > >> it
> > > > > > > >>>>>> ships out of the box with, one stack that's the default
> > > stack
> > > > > > > >>>>>> in
> > > > > > the
> > > > > > > >>>>> sense
> > > > > > > >>>>>> that it's the most tested and so on. And, for us, that's
> > our
> > > > > > current
> > > > > > > >>>>> stack.
> > > > > > > >>>>>> If we were to take Apache Hive as an example, it shipped
> > > (and
> > > > > > ships)
> > > > > > > >>> with
> > > > > > > >>>>>> MapReduce as the default configuration engine. At some
> > > point,
> > > > > > Apache
> > > > > > > >>> Tez
> > > > > > > >>>>>> came along and wanted Hive to run on Tez, so they made a
> > > > > > > >>>>>> bunch of
> > > > > > > >>> things
> > > > > > > >>>>>> pluggable to run Hive on Tez (instead of the only option
> > > > > > > >>>>>> up-until
> > > > > > > >> then:
> > > > > > > >>>>>> Hive-on-MR) and then Apache Spark came and re-used some
> of
> > > > > > > >>>>>> that pluggability and even added some more so
> > Hive-on-Spark
> > > > > > > >>>>>> could
> > > > > > become
> > > > > > > a
> > > > > > > >>>>>> reality. In the same way, I don't think anyone disagrees
> > > here
> > > > > > > >>>>>> that pluggabilty is a good thing but it's hard to do
> > > > > > > >>>>>> pluggability
> > > > > > right,
> > > > > > > >> and
> > > > > > > >>>>> at
> > > > > > > >>>>>> the right level, unless on has a clear use-case in mind.
> > > > > > > >>>>>>
> > > > > > > >>>>>> As a project, we have many things to do and I personally
> > > > > > > >>>>>> think the
> > > > > > > >>>>> biggest
> > > > > > > >>>>>> bang for the buck for us in making Spot a really solid
> and
> > > > > > > >>>>>> the
> > > > > > best
> > > > > > > >>> cyber
> > > > > > > >>>>>> security solution isn't pluggability but the things we
> are
> > > > > > > >>>>>> working
> > > > > > > on
> > > > > > > >>> - a
> > > > > > > >>>>>> better user interface, a common/unified approach to
> > storing
> > > > > > > >>>>>> and
> > > > > > > >>> modeling
> > > > > > > >>>>>> data, etc.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Having said that, we are open, if it's important to you
> or
> > > > > > > >>>>>> someone
> > > > > > > >>> else,
> > > > > > > >>>>>> we'd be happy to receive and review those patches.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Thanks!
> > > > > > > >>>>>> Mark
> > > > > > > >>>>>>
> > > > > > > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali
> > > > > > > >>>>>> <[email protected]
> > > > > > >
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>>> Thanks Ross! and yes option C sounds good to me as well
> > > > > > > >>>>>>> however I
> > > > > > > >> just
> > > > > > > >>>>>>> think Distributed Sql query engine  and the resource
> > > manager
> > > > > > should
> > > > > > > >> be
> > > > > > > >>>>>>> pluggable.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
> > > > > > > >> [email protected]>
> > > > > > > >>>>>>> wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>> Option C is to use python on the front end of ingest
> > > > > > > >>>>>>>> pipeline
> > > > > > and
> > > > > > > >>>>>>>> spark/scala on the back end.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Option A uses python workers on the backend
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Option B uses all scala.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> -----Original Message-----
> > > > > > > >>>>>>>> From: kant kodali [mailto:[email protected]]
> > > > > > > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
> > > > > > > >>>>>>>> To: [email protected]
> > > > > > > >>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> What is option C ? am I missing an email or something?
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai
> <
> > > > > > > >>>>>>>> [email protected]> wrote:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>> +1 for Python 3.x
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>> I think that C is the strong solution, getting the
> > > ingest
> > > > > > really
> > > > > > > >>>>>>>>>> strong is going to lower barriers to adoption. Doing
> > it
> > > > > > > >>>>>>>>>> in
> > > > > > > Python
> > > > > > > >>>>>>>>>> will open up the ingest portion of the project to
> > > include
> > > > > > > >>>>>>>>>> many
> > > > > > > >>>>> more
> > > > > > > >>>>>>>> developers.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Before it comes up I would like to throw the
> following
> > > on
> > > > > > > >>>>>>>>>> the
> > > > > > > >>>>>> pile...
> > > > > > > >>>>>>>>>> Major
> > > > > > > >>>>>>>>>> python projects django/flash, others are dropping
> 2.x
> > > > > > > >>>>>>>>>> support
> > > > > > in
> > > > > > > >>>>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop
> > > > > > > >>>>>>>>>> projects
> > > > > > in
> > > > > > > >>>>>>>>>> general tend to lag in modern python support, lets
> > > please
> > > > > > build
> > > > > > > >>>>> this
> > > > > > > >>>>>>>>>> in 3.5 so that we don't have to immediately expect a
> > > > > > > >>>>>>>>>> rebuild
> > > > > > in
> > > > > > > >>>>> the
> > > > > > > >>>>>>>>>> pipeline.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> -Vote C
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Thanks Nate
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Austin
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross
> > > > > > > >>>>>>>>>> <[email protected]>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> I really like option C because it gives a lot of
> > > > > > > >>>>>>>>>> flexibility
> > > > > > for
> > > > > > > >>>>>>>>>> ingest
> > > > > > > >>>>>>>>>>> (python vs scala) but still has the robust spark
> > > > > > > >>>>>>>>>>> streaming
> > > > > > > >>>>> backend
> > > > > > > >>>>>>>>>>> for performance.
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Thanks for putting this together Nate.
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> Alan
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha
> > Palayamkottai <
> > > > > > > >>>>>>>>>>> [email protected]> wrote:
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>> I agree. We should continue making the existing
> stack
> > > > > > > >>>>>>>>>>> more
> > > > > > > >> mature
> > > > > > > >>>>>> at
> > > > > > > >>>>>>>>>>>> this point. Maybe if we have enough community
> > support
> > > > > > > >>>>>>>>>>>> we can
> > > > > > > >> add
> > > > > > > >>>>>>>>>>>> additional datastores.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Chokha.
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote:
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Hi Kant,
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If
> you're
> > > > > > > >>>>>>>>>>>>> using
> > > > > > > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said,
> > Spot
> > > > > > > >>>>>>>>>>>>> is
> > > > > > based
> > > > > > > >>>>> on
> > > > > > > >>>>>> a
> > > > > > > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch
> > too
> > > > > > > >>>>>>>>>>>>> many
> > > > > > > >>>>> pieces
> > > > > > > >>>>>>>> yet.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> In most Opensource projects you start relying on
> a
> > > > > > well-known
> > > > > > > >>>>>>>>>>>>> stack and then you begin to support other DB
> > backends
> > > > > > > >>>>>>>>>>>>> once
> > > > > > > >> it's
> > > > > > > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps
> which
> > > > > > > >>>>>>>>>>>>> haven't
> > > > > > > >>>>> been
> > > > > > > >>>>>>>>>>>>> ported away from MySQL yet.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> In any case, you'll need a high performance SQL +
> > > > > > > >>>>>>>>>>>>> Massive
> > > > > > > >>>>> Storage
> > > > > > > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and...
> ATM,
> > > > > > > >>>>>>>>>>>>> + that
> > > > > > can
> > > > > > > >> be
> > > > > > > >>>>>>>>>>>>> only provided by Hadoop.
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Regards!
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> Kenneth
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > > > > > > >>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Hi Kenneth,
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Thanks for the response.  I think you made a
> case
> > > for
> > > > > > > >>>>>>>>>>>>>> HDFS however users may want to use S3 or some
> > other
> > > > > > > >>>>>>>>>>>>>> FS in which
> > > > > > > >>>>> case
> > > > > > > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no
> > > > > > > >>>>>>>>>>>>>> changes
> > > > > > > needed
> > > > > > > >>>>>>>>>>>>>> within Spot in which case I
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> can
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> agree to that). for example, Netflix stores all
> > there
> > > > > > > >>>>>>>>>>>> data
> > > > > > > into
> > > > > > > >>>>> S3
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> The distributed sql query engine I would say
> > should
> > > > > > > >>>>>>>>>>>>>> be
> > > > > > > >>>>> pluggable
> > > > > > > >>>>>>>>>>>>>> with whatever user may want to use and there a
> > bunch
> > > > > > > >>>>>>>>>>>>>> of
> > > > > > them
> > > > > > > >>>>> out
> > > > > > > >>>>>>>> there.
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> sure
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Impala is better than hive but what if users are
> > > > > > > >>>>>>>>>>>> already
> > > > > > using
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> something
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> else like Drill or Presto?
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Me personally, would not assume that users are
> > > > > > > >>>>>>>>>>>>>> willing to
> > > > > > > >>>>> deploy
> > > > > > > >>>>>>>>>>>>>> all
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> of
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> that and make their existing stack more
> complicated
> > at
> > > > > > > >>>>>>>>>>>> very
> > > > > > > >>>>> least
> > > > > > > >>>>>> I
> > > > > > > >>>>>>>>>>>>>> would
> > > > > > > >>>>>>>>>>>>>> say it is a uphill battle. Things have been
> > changing
> > > > > > rapidly
> > > > > > > >>>>> in
> > > > > > > >>>>>>>>>>>>>> Big
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> data
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> space so whatever we think is standard won't be
> > > > > > > >>>>>>>>>>>> standard
> > > > > > > >> anymore
> > > > > > > >>>>>>>>>>>> but
> > > > > > > >>>>>>>>>>>>>> importantly there shouldn't be any reason why we
> > > > > > > >>>>>>>>>>>>>> shouldn't
> > > > > > > be
> > > > > > > >>>>>>>>>>>>>> flexible right.
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make
> > that
> > > > > > > >>>>>>>>>>>>>> also
> > > > > > > more
> > > > > > > >>>>>>>>>>>>>> flexible so users can pick Mesos or standalone.
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> I think Flexibility is a key for a wide adoption
> > > > > > > >>>>>>>>>>>>>> rather
> > > > > > than
> > > > > > > >>>>> the
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>> tightly
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> coupled architecture.
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> Thanks!
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> > > > > > > >>>>>>>>>>>>>> <[email protected]>
> > > > > > > >>>>>>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> PS: you need a big data platform to be able to
> > > > > > > >>>>>>>>>>>>>> collect all
> > > > > > > >>>>> those
> > > > > > > >>>>>>>>>>>>>>> netflows
> > > > > > > >>>>>>>>>>>>>>> and logs.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear,
> then
> > > you
> > > > > > > >>>>>>>>>>>>>>> need
> > > > > > > >>>>> loads
> > > > > > > >>>>>>>>>>>>>>> of data to get ML working properly, and
> somewhere
> > > to
> > > > > > > >>>>>>>>>>>>>>> run
> > > > > > > >>>>> those
> > > > > > > >>>>>>>>>>>>>>> algorithms. That
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> is
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Hadoop.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Regards!
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Kenneth
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Sent from my Mi phone On kant kodali
> > > > > > > >>>>>>>>>>>>>>> <[email protected]>, Apr 14, 2017 4:04
> > > > > > AM
> > > > > > > >>>>>> wrote:
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Hi,
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my
> > > feedback.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> I somehow think the architecture is too
> > complicated
> > > > > > > >>>>>>>>>>>>>>> for
> > > > > > > wide
> > > > > > > >>>>>>>>>>>>>>> adoption since it requires to install the
> > > following.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> HDFS.
> > > > > > > >>>>>>>>>>>>>>> HIVE.
> > > > > > > >>>>>>>>>>>>>>> IMPALA.
> > > > > > > >>>>>>>>>>>>>>> KAFKA.
> > > > > > > >>>>>>>>>>>>>>> SPARK (YARN).
> > > > > > > >>>>>>>>>>>>>>> YARN.
> > > > > > > >>>>>>>>>>>>>>> Zookeeper.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Currently there are way too many dependencies
> > that
> > > > > > > >>>>> discourages
> > > > > > > >>>>>>>>>>>>>>> lot of users from using it because they have to
> > go
> > > > > > through
> > > > > > > >>>>>>>>>>>>>>> deployment of all that required software. I
> think
> > > > > > > >>>>>>>>>>>>>>> for
> > > > > > wide
> > > > > > > >>>>>>>>>>>>>>> option we should minimize the dependencies and
> > have
> > > > > > > >>>>>>>>>>>>>>> more pluggable architecture. for example I am
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> not
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why not
> > just
> > > > > > > >>>>>>>>>>>> use
> > > > > > > >> Spark
> > > > > > > >>>>>>>>>>>> SQL
> > > > > > > >>>>>>>>>>>>>>> since
> > > > > > > >>>>>>>>>>>>>>> its already dependency or say users may want to
> > use
> > > > > > > >>>>>>>>>>>>>>> their
> > > > > > > >> own
> > > > > > > >>>>>>>>>>>>>>> distributed query engine they like such as
> Apache
> > > > > > > >>>>>>>>>>>>>>> Drill
> > > > > > or
> > > > > > > >>>>>>>>>>>>>>> something else. we should be flexible enough to
> > > > > > > >>>>>>>>>>>>>>> provide
> > > > > > > that
> > > > > > > >>>>>>>>>>>>>>> option
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that
> > collectors
> > > > > > > >>>>>>>>>>>>>>> can
> > > > > > > >>>>> receive
> > > > > > > >>>>>>>>>>>>>>> file path's through Kafka and be able to read a
> > > > > > > >>>>>>>>>>>>>>> file. How
> > > > > > > >> big
> > > > > > > >>>>>>>>>>>>>>> are these files ?
> > > > > > > >>>>>>>>>>>>>>> Do we
> > > > > > > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide more
> > > ways
> > > > > > > >>>>>>>>>>>>>>> to
> > > > > > > send
> > > > > > > >>>>>>>>>>>>>>> data such as sending data directly through
> Kafka
> > or
> > > > > > > >>>>>>>>>>>>>>> say
> > > > > > > just
> > > > > > > >>>>>>>>>>>>>>> leaving up to the user to specify the file
> > location
> > > > > > > >>>>>>>>>>>>>>> as an argument to collector process
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow
> data
> > > one
> > > > > > would
> > > > > > > >>>>>>>>>>>>>>> require a specific hardware. This really means
> > > > > > > >>>>>>>>>>>>>>> Apache
> > > > > > Spot
> > > > > > > >> is
> > > > > > > >>>>>>>>>>>>>>> not meant for everyone.
> > > > > > > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze
> the
> > > > > > > >>>>>>>>>>>>>>> network
> > > > > > > >>>>>> traffic
> > > > > > > >>>>>>>>>>>>>>> of
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>> any
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> machine but if it requires a specific hard then I
> > > think
> > > > > > > >>>>>>>>>>>> it
> > > > > > is
> > > > > > > >>>>>>>>>>>>>>> targeted for
> > > > > > > >>>>>>>>>>>>>>> specific group of people.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> The real strength of Apache Spot should mainly
> be
> > > > > > > >>>>>>>>>>>>>>> just
> > > > > > > >>>>>> analyzing
> > > > > > > >>>>>>>>>>>>>>> network traffic through ML.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Thanks!
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind,
> > Nathan
> > > L
> > > > > > > >>>>>>>>>>>>>>> < [email protected]> wrote:
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> Thanks, Nate,
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Nate.
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> -----Original Message-----
> > > > > > > >>>>>>>>>>>>>>>> From: Nate Smith [mailto:
> [email protected]]
> > > > > > > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> > > > > > > >>>>>>>>>>>>>>>> To: [email protected]
> > > > > > > >>>>>>>>>>>>>>>> Cc: [email protected];
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> [email protected]
> > > > > > > >>>>>>>>>>>
> > > > > > > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for
> > Spot-ingest
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh
> well
> > :)
> > > > > > Here’s
> > > > > > > >> an
> > > > > > > >>>>>>>>>>>>>>>> image form:
> > > > > > > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan
> > L <
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> [email protected]> wrote:
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> The diagram became garbled in the text
> format.
> > > > > > > >>>>>>>>>>>>>>>>> Could you resend it as a pdf?
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>>>>>>>> Nate
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> -----Original Message-----
> > > > > > > >>>>>>>>>>>>>>>>> From: Nathanael Smith
> > > > > > > >>>>>>>>>>>>>>>>> [mailto:[email protected]]
> > > > > > > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> > > > > > > >>>>>>>>>>>>>>>>> To: [email protected];
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> [email protected];
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> [email protected]
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for
> > Spot-ingest
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest change?
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> A. continue development on the Python
> > > > > > > >>>>>>>>>>>>>>>>> Master/Worker
> > > > > > with
> > > > > > > >>>>>> focus
> > > > > > > >>>>>>>>>>>>>>>>> on
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> performance / error handling / logging B.
> > Develop
> > > > > > > >>>>>>>>>>>>>>>> Scala
> > > > > > > >>>>> based
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>> ingest to
> > > > > > > >>>>>>>>>>>>>>> be
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA
> (UI
> > > to
> > > > > > > >> continue
> > > > > > > >>>>>>>>>>>>>>>> being
> > > > > > > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala
> > > > > > > >>>>>>>>>>>>>>>> based
> > > > > > Spark
> > > > > > > >>>>>> code
> > > > > > > >>>>>>>>>>>>>>>> for normalization and input into DB
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Including the high level diagram:
> > > > > > > >>>>>>>>>>>>>>>>> +-----------------------------
> > > > > > > >>>>> ------------------------------
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> -------------------------------+
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | +--------------------------+
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> +-----------------+        |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | | Master                   |  A. B. C.
> > > > > > > >>>>>>> |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Worker          |        |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | |    A. Python
>  +---------------+
> > > > > A.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> |   A.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Python     |        |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | |    B. Scala              |
>  |
> > > > > > > >>>>>>> +------------->
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>         +----+   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | |    C. Python             |
>  |
> > >   |
> > > > > > > >>>>>>> |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>         |    |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | +---^------+---------------+
>  |
> > >   |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> +-----------------+    |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |     |      |
>  |
> > >   |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>              |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |     |      |
>  |
> > >   |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>              |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |     |     +Note--------------+
>  |
> > >   |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> +-----------------+    |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |     |     |Running on a      |
>  |
> > >   |
> > > > > > > >>>>>>> |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Spark
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Streaming |    |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |     |     |worker node in    |
>  |
> > >   |
> > > > > > > >> B.
> > > > > > > >>>>>> C.
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> | B.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Scala        |    |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |     |     |the Hadoop cluster|
>  |
> > >   |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> +--------> C.
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Scala        +-+  |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |     |     +------------------+
>  |
> > >   |
> > > > > > |
> > > > > > > >>>>>>> |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>         | |  |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |   A.|
> |
> > >   |
> > > > > > |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> +-----------------+ |  |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |   B.|
> |
> > >   |
> > > > > > |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>            |  |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> |   C.|
> |
> > >   |
> > > > > > |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>            |  |   |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | +----------------------+
> > > > > > > +-v------+----+----+-+
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> +--------------v--v-+ |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | |                      |          |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>                 | |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | |   Local FS:          |          |    hdfs
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> Hive / Impala    | |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | |  - Binary/Text       |          |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> - Parquet -     | |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | |    Log files -       |          |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>                 | |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | |                      |          |
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> |           |
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>                 | |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> | +----------------------+
> > > > > > > +--------------------+
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> +-------------------+ |
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> +-----------------------------
> > > > > > > >>>>> ------------------------------
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>> -------------------------------+
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> Please let me know your thoughts,
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>> - Nathanael
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>>>>
> > > > > > > >>>>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> --
> > > > > > > >>>> Michael Ridley <[email protected]>
> > > > > > > >>>> office: (650) 352-1337
> > > > > > > >>>> mobile: (571) 438-2420
> > > > > > > >>>> Senior Solutions Architect
> > > > > > > >>>> Cloudera, Inc.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Michael Ridley <[email protected]>
> > > > > > > >> office: (650) 352-1337
> > > > > > > >> mobile: (571) 438-2420
> > > > > > > >> Senior Solutions Architect
> > > > > > > >> Cloudera, Inc.
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Michael Ridley <[email protected]>
office: (650) 352-1337
mobile: (571) 438-2420
Senior Solutions Architect
Cloudera, Inc.

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to