Re: [Discuss] - Future plans for Spot-ingest

Mark Grover Mon, 17 Apr 2017 13:12:39 -0700

Thanks all your opinion.

I think it's good to consider two things:
1. What do (we think) users care about?
2. What's the cost of changing things?


About #1, I think users care more about what format data is written than
how the data is written. I'd argue whether that uses Hive, MR, or a custom
Parquet writer is not as important to them as long as we maintain
data/format compatibility.
About #2, having worked on several projects, I find that it's rather
difficult to keep up with Parquet. Even in Spark, there are a few different
ways to write to Parquet - there's a regular mode, and a legacy mode
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L44>
which
continues to cause confusion
<https://issues.apache.org/jira/browse/SPARK-20297> till date. Parquet
itself is pretty dependent on Hadoop
<https://github.com/Parquet/parquet-mr/search?l=Maven+POM&q=hadoop&type=&utf8=%E2%9C%93>
and,
just integrating it with systems with a lot of developers (like Spark
<https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=spark+parquet+jiras>)
is still a lot of work.

I personally think we should leverage higher level tools like Hive, or
Spark to write data in widespread formats (Parquet, being a very good
example) but I personally wouldn't encourage us to manage the writers
ourselves.

Thoughts?
Mark

On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <[email protected]>
wrote:

> Without having given it too terribly much thought, that seems like an OK
> approach.
>
> Michael
>
> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <[email protected]>
> wrote:
>
> > I think the question is rather we can write the data generically to HDFS
> > as parquet without the use of hive/impala?
> >
> > Today we write parquet data using the hive/mapreduce method.
> > As part of the redesign i’d like to use libraries for this as opposed to
> a
> > hadoop dependency.
> > I think it would be preferred to use the python master to write the data
> > into the format we want, then do normalization of the data in spark
> > streaming.
> > Any thoughts?
> >
> > - Nathanael
> >
> >
> >
> > > On Apr 17, 2017, at 11:08 AM, Michael Ridley <[email protected]>
> > wrote:
> > >
> > > I had thought that the plan was to write the data in Parquet in HDFS
> > > ultimately.
> > >
> > > Michael
> > >
> > > On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]>
> > wrote:
> > >
> > >> Hi Mark,
> > >>
> > >> Thank you so much for hearing my argument. And I definetly understand
> > that
> > >> you guys have bunch of things to do. My only concern is that I hope it
> > >> doesn't take too long too support other backends. For example @Kenneth
> > had
> > >> given an example of LAMP stack had not moved away from mysql yet which
> > >> essentially means its probably a decade ? I see that in the current
> > >> architecture the results from with python multiprocessing or Spark
> > >> Streaming are written back to HDFS and  If so, can we write them in
> > parquet
> > >> format ? such that users should be able to plug in any query engine
> but
> > >> again I am not pushing you guys to do this right away or anything just
> > >> seeing if there a way for me to get started in parallel and if not
> > >> feasible, its fine I just wanted to share my 2 cents and I am glad my
> > >> argument is heard!
> > >>
> > >> Thanks much!
> > >>
> > >> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> wrote:
> > >>
> > >>> Hi Kant,
> > >>> Just wanted to make sure you don't feel like we are ignoring your
> > >>> comment:-) I hear you about pluggability.
> > >>>
> > >>> The design can and should be pluggable but the project has one stack
> it
> > >>> ships out of the box with, one stack that's the default stack in the
> > >> sense
> > >>> that it's the most tested and so on. And, for us, that's our current
> > >> stack.
> > >>> If we were to take Apache Hive as an example, it shipped (and ships)
> > with
> > >>> MapReduce as the default configuration engine. At some point, Apache
> > Tez
> > >>> came along and wanted Hive to run on Tez, so they made a bunch of
> > things
> > >>> pluggable to run Hive on Tez (instead of the only option up-until
> then:
> > >>> Hive-on-MR) and then Apache Spark came and re-used some of that
> > >>> pluggability and even added some more so Hive-on-Spark could become a
> > >>> reality. In the same way, I don't think anyone disagrees here that
> > >>> pluggabilty is a good thing but it's hard to do pluggability right,
> and
> > >> at
> > >>> the right level, unless on has a clear use-case in mind.
> > >>>
> > >>> As a project, we have many things to do and I personally think the
> > >> biggest
> > >>> bang for the buck for us in making Spot a really solid and the best
> > cyber
> > >>> security solution isn't pluggability but the things we are working on
> > - a
> > >>> better user interface, a common/unified approach to storing and
> > modeling
> > >>> data, etc.
> > >>>
> > >>> Having said that, we are open, if it's important to you or someone
> > else,
> > >>> we'd be happy to receive and review those patches.
> > >>>
> > >>> Thanks!
> > >>> Mark
> > >>>
> > >>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]>
> > >> wrote:
> > >>>
> > >>>> Thanks Ross! and yes option C sounds good to me as well however I
> just
> > >>>> think Distributed Sql query engine  and the resource manager should
> be
> > >>>> pluggable.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
> [email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> Option C is to use python on the front end of ingest pipeline and
> > >>>>> spark/scala on the back end.
> > >>>>>
> > >>>>> Option A uses python workers on the backend
> > >>>>>
> > >>>>> Option B uses all scala.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> -----Original Message-----
> > >>>>> From: kant kodali [mailto:[email protected]]
> > >>>>> Sent: Friday, April 14, 2017 9:53 AM
> > >>>>> To: [email protected]
> > >>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > >>>>>
> > >>>>> What is option C ? am I missing an email or something?
> > >>>>>
> > >>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
> > >>>>> [email protected]> wrote:
> > >>>>>
> > >>>>>> +1 for Python 3.x
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
> > >>>>>>
> > >>>>>>> I think that C is the strong solution, getting the ingest really
> > >>>>>>> strong is going to lower barriers to adoption. Doing it in Python
> > >>>>>>> will open up the ingest portion of the project to include many
> > >> more
> > >>>>> developers.
> > >>>>>>>
> > >>>>>>> Before it comes up I would like to throw the following on the
> > >>> pile...
> > >>>>>>> Major
> > >>>>>>> python projects django/flash, others are dropping 2.x support in
> > >>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects in
> > >>>>>>> general tend to lag in modern python support, lets please build
> > >> this
> > >>>>>>> in 3.5 so that we don't have to immediately expect a rebuild in
> > >> the
> > >>>>>>> pipeline.
> > >>>>>>>
> > >>>>>>> -Vote C
> > >>>>>>>
> > >>>>>>> Thanks Nate
> > >>>>>>>
> > >>>>>>> Austin
> > >>>>>>>
> > >>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]>
> > >> wrote:
> > >>>>>>>
> > >>>>>>> I really like option C because it gives a lot of flexibility for
> > >>>>>>> ingest
> > >>>>>>>> (python vs scala) but still has the robust spark streaming
> > >> backend
> > >>>>>>>> for performance.
> > >>>>>>>>
> > >>>>>>>> Thanks for putting this together Nate.
> > >>>>>>>>
> > >>>>>>>> Alan
> > >>>>>>>>
> > >>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
> > >>>>>>>> [email protected]> wrote:
> > >>>>>>>>
> > >>>>>>>> I agree. We should continue making the existing stack more
> mature
> > >>> at
> > >>>>>>>>> this point. Maybe if we have enough community support we can
> add
> > >>>>>>>>> additional datastores.
> > >>>>>>>>>
> > >>>>>>>>> Chokha.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi Kant,
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using
> > >>>>>>>>>> Hive+Spark, then sure you'll have YARN.
> > >>>>>>>>>>
> > >>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based
> > >> on
> > >>> a
> > >>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many
> > >> pieces
> > >>>>> yet.
> > >>>>>>>>>>
> > >>>>>>>>>> In most Opensource projects you start relying on a well-known
> > >>>>>>>>>> stack and then you begin to support other DB backends once
> it's
> > >>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't
> > >> been
> > >>>>>>>>>> ported away from MySQL yet.
> > >>>>>>>>>>
> > >>>>>>>>>> In any case, you'll need a high performance SQL + Massive
> > >> Storage
> > >>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that can
> be
> > >>>>>>>>>> only provided by Hadoop.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards!
> > >>>>>>>>>>
> > >>>>>>>>>> Kenneth
> > >>>>>>>>>>
> > >>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Kenneth,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the response.  I think you made a case for HDFS
> > >>>>>>>>>>> however users may want to use S3 or some other FS in which
> > >> case
> > >>>>>>>>>>> they can use Auxilio (hoping that there are no changes needed
> > >>>>>>>>>>> within Spot in which case I
> > >>>>>>>>>>>
> > >>>>>>>>>> can
> > >>>>>>>>
> > >>>>>>>>> agree to that). for example, Netflix stores all there data into
> > >> S3
> > >>>>>>>>>>>
> > >>>>>>>>>>> The distributed sql query engine I would say should be
> > >> pluggable
> > >>>>>>>>>>> with whatever user may want to use and there a bunch of them
> > >> out
> > >>>>> there.
> > >>>>>>>>>>>
> > >>>>>>>>>> sure
> > >>>>>>>>
> > >>>>>>>>> Impala is better than hive but what if users are already using
> > >>>>>>>>>>>
> > >>>>>>>>>> something
> > >>>>>>>>
> > >>>>>>>>> else like Drill or Presto?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Me personally, would not assume that users are willing to
> > >> deploy
> > >>>>>>>>>>> all
> > >>>>>>>>>>>
> > >>>>>>>>>> of
> > >>>>>>>>
> > >>>>>>>>> that and make their existing stack more complicated at very
> > >> least
> > >>> I
> > >>>>>>>>>>> would
> > >>>>>>>>>>> say it is a uphill battle. Things have been changing rapidly
> > >> in
> > >>>>>>>>>>> Big
> > >>>>>>>>>>>
> > >>>>>>>>>> data
> > >>>>>>>>
> > >>>>>>>>> space so whatever we think is standard won't be standard
> anymore
> > >>>>>>>>> but
> > >>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't be
> > >>>>>>>>>>> flexible right.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Also I am not sure why only YARN? why not make that also more
> > >>>>>>>>>>> flexible so users can pick Mesos or standalone.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think Flexibility is a key for a wide adoption rather than
> > >> the
> > >>>>>>>>>>>
> > >>>>>>>>>> tightly
> > >>>>>>>>
> > >>>>>>>>> coupled architecture.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks!
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
> > >>>>>>>>>>> <[email protected]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> PS: you need a big data platform to be able to collect all
> > >> those
> > >>>>>>>>>>>> netflows
> > >>>>>>>>>>>> and logs.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need
> > >> loads
> > >>>>>>>>>>>> of data to get ML working properly, and somewhere to run
> > >> those
> > >>>>>>>>>>>> algorithms. That
> > >>>>>>>>>>>>
> > >>>>>>>>>>> is
> > >>>>>>>>
> > >>>>>>>>> Hadoop.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Regards!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Kenneth
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Sent from my Mi phone
> > >>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM
> > >>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for starting this thread. Here is my feedback.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I somehow think the architecture is too complicated for wide
> > >>>>>>>>>>>> adoption since it requires to install the following.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> HDFS.
> > >>>>>>>>>>>> HIVE.
> > >>>>>>>>>>>> IMPALA.
> > >>>>>>>>>>>> KAFKA.
> > >>>>>>>>>>>> SPARK (YARN).
> > >>>>>>>>>>>> YARN.
> > >>>>>>>>>>>> Zookeeper.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Currently there are way too many dependencies that
> > >> discourages
> > >>>>>>>>>>>> lot of users from using it because they have to go through
> > >>>>>>>>>>>> deployment of all that required software. I think for wide
> > >>>>>>>>>>>> option we should minimize the dependencies and have more
> > >>>>>>>>>>>> pluggable architecture. for example I am
> > >>>>>>>>>>>>
> > >>>>>>>>>>> not
> > >>>>>>>>
> > >>>>>>>>> sure why HIVE & IMPALA both are required? why not just use
> Spark
> > >>>>>>>>> SQL
> > >>>>>>>>>>>> since
> > >>>>>>>>>>>> its already dependency or say users may want to use their
> own
> > >>>>>>>>>>>> distributed query engine they like such as Apache Drill or
> > >>>>>>>>>>>> something else. we should be flexible enough to provide that
> > >>>>>>>>>>>> option
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Also, I see that HDFS is used such that collectors can
> > >> receive
> > >>>>>>>>>>>> file path's through Kafka and be able to read a file. How
> big
> > >>>>>>>>>>>> are these files ?
> > >>>>>>>>>>>> Do we
> > >>>>>>>>>>>> really need HDFS for this? Why not provide more ways to send
> > >>>>>>>>>>>> data such as sending data directly through Kafka or say just
> > >>>>>>>>>>>> leaving up to the user to specify the file location as an
> > >>>>>>>>>>>> argument to collector process
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Finally, I learnt that to generate Net flow data one would
> > >>>>>>>>>>>> require a specific hardware. This really means Apache Spot
> is
> > >>>>>>>>>>>> not meant for everyone.
> > >>>>>>>>>>>> I thought Apache Spot can be used to analyze the network
> > >>> traffic
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>
> > >>>>>>>>>>> any
> > >>>>>>>>
> > >>>>>>>>> machine but if it requires a specific hard then I think it is
> > >>>>>>>>>>>> targeted for
> > >>>>>>>>>>>> specific group of people.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The real strength of Apache Spot should mainly be just
> > >>> analyzing
> > >>>>>>>>>>>> network traffic through ML.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
> > >>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks, Nate,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Nate.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]]
> > >>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
> > >>>>>>>>>>>>> To: [email protected]
> > >>>>>>>>>>>>> Cc: [email protected];
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> [email protected]
> > >>>>>>>>
> > >>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I was really hoping it came through ok, Oh well :) Here’s
> an
> > >>>>>>>>>>>>> image form:
> > >>>>>>>>>>>>> http://imgur.com/a/DUDsD
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> [email protected]> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The diagram became garbled in the text format.
> > >>>>>>>>>>>>>> Could you resend it as a pdf?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>> Nate
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]]
> > >>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
> > >>>>>>>>>>>>>> To: [email protected];
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> [email protected];
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> [email protected]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> How would you like to see Spot-ingest change?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> A. continue development on the Python Master/Worker with
> > >>> focus
> > >>>>>>>>>>>>>> on
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> performance / error handling / logging B. Develop Scala
> > >> based
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> ingest to
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to
> continue
> > >>>>>>>>>>>>> being
> > >>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark
> > >>> code
> > >>>>>>>>>>>>> for normalization and input into DB
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Including the high level diagram:
> > >>>>>>>>>>>>>> +-----------------------------
> > >> ------------------------------
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------+
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | +--------------------------+
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> +-----------------+        |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | | Master                   |  A. B. C.
> > >>>>  |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Worker          |        |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | |    A. Python             +---------------+      A.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> |   A.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Python     |        |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | |    B. Scala              |               |
> > >>>> +------------->
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>          +----+   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | |    C. Python             |               |    |
> > >>>> |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>          |    |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | +---^------+---------------+               |    |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>  +-----------------+    |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |     |      |                               |    |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>               |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |     |      |                               |    |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>               |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |     |     +Note--------------+             |    |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>  +-----------------+    |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |     |     |Running on a      |             |    |
> > >>>> |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Spark
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Streaming |    |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |     |     |worker node in    |             |    |
> B.
> > >>> C.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> | B.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Scala        |    |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |     |     |the Hadoop cluster|             |    |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> +--------> C.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Scala        +-+  |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |     |     +------------------+             |    |    |
> > >>>>  |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>          | |  |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |   A.|                                      |    |    |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> +-----------------+ |  |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |   B.|                                      |    |    |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>             |  |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> |   C.|                                      |    |    |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>             |  |   |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | +----------------------+          +-v------+----+----+-+
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>  +--------------v--v-+ |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | |                      |          |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>                  | |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | |   Local FS:          |          |    hdfs
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hive / Impala    | |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | |  - Binary/Text       |          |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>  - Parquet -     | |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | |    Log files -       |          |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>                  | |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | |                      |          |
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> |           |
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>                  | |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> | +----------------------+          +--------------------+
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>  +-------------------+ |
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> +-----------------------------
> > >> ------------------------------
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------+
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Please let me know your thoughts,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - Nathanael
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> > >
> > >
> > > --
> > > Michael Ridley <[email protected]>
> > > office: (650) 352-1337
> > > mobile: (571) 438-2420
> > > Senior Solutions Architect
> > > Cloudera, Inc.
> >
> >
>
>
> --
> Michael Ridley <[email protected]>
> office: (650) 352-1337
> mobile: (571) 438-2420
> Senior Solutions Architect
> Cloudera, Inc.
>

Re: [Discuss] - Future plans for Spot-ingest

Reply via email to