> On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]> wrote:
>
> Thanks all your opinion.
>
> I think it's good to consider two things:
> 1. What do (we think) users care about?
> 2. What's the cost of changing things?
>
> About #1, I think users care more about what format data is written than
> how the data is written. I'd argue whether that uses Hive, MR, or a
custom
> Parquet writer is not as important to them as long as we maintain
> data/format compatibility.
> About #2, having worked on several projects, I find that it's rather
> difficult to keep up with Parquet. Even in Spark, there are a few
different
> ways to write to Parquet - there's a regular mode, and a legacy mode
> <https://github.com/apache/spark/blob/master/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/parquet/
ParquetWriteSupport.scala#L44>
> which
> continues to cause confusion
> <https://issues.apache.org/jira/browse/SPARK-20297> till date. Parquet
> itself is pretty dependent on Hadoop
> <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&
q=hadoop&type=&utf8=%E2%9C%93>
> and,
> just integrating it with systems with a lot of developers (like Spark
> <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&
espv=2&ie=UTF-8#q=spark+parquet+jiras>)
> is still a lot of work.
>
> I personally think we should leverage higher level tools like Hive, or
> Spark to write data in widespread formats (Parquet, being a very good
> example) but I personally wouldn't encourage us to manage the writers
> ourselves.
>
> Thoughts?
> Mark
>
> On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <[email protected]>
> wrote:
>
>> Without having given it too terribly much thought, that seems like an OK
>> approach.
>>
>> Michael
>>
>> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <[email protected]>
>> wrote:
>>
>>> I think the question is rather we can write the data generically to
HDFS
>>> as parquet without the use of hive/impala?
>>>
>>> Today we write parquet data using the hive/mapreduce method.
>>> As part of the redesign i’d like to use libraries for this as opposed
to
>> a
>>> hadoop dependency.
>>> I think it would be preferred to use the python master to write the
data
>>> into the format we want, then do normalization of the data in spark
>>> streaming.
>>> Any thoughts?
>>>
>>> - Nathanael
>>>
>>>
>>>
>>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley <[email protected]>
>>> wrote:
>>>>
>>>> I had thought that the plan was to write the data in Parquet in HDFS
>>>> ultimately.
>>>>
>>>> Michael
>>>>
>>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]>
>>> wrote:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> Thank you so much for hearing my argument. And I definetly understand
>>> that
>>>>> you guys have bunch of things to do. My only concern is that I hope
it
>>>>> doesn't take too long too support other backends. For example
@Kenneth
>>> had
>>>>> given an example of LAMP stack had not moved away from mysql yet
which
>>>>> essentially means its probably a decade ? I see that in the current
>>>>> architecture the results from with python multiprocessing or Spark
>>>>> Streaming are written back to HDFS and If so, can we write them in
>>> parquet
>>>>> format ? such that users should be able to plug in any query engine
>> but
>>>>> again I am not pushing you guys to do this right away or anything
just
>>>>> seeing if there a way for me to get started in parallel and if not
>>>>> feasible, its fine I just wanted to share my 2 cents and I am glad my
>>>>> argument is heard!
>>>>>
>>>>> Thanks much!
>>>>>
>>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]>
wrote:
>>>>>
>>>>>> Hi Kant,
>>>>>> Just wanted to make sure you don't feel like we are ignoring your
>>>>>> comment:-) I hear you about pluggability.
>>>>>>
>>>>>> The design can and should be pluggable but the project has one stack
>> it
>>>>>> ships out of the box with, one stack that's the default stack in the
>>>>> sense
>>>>>> that it's the most tested and so on. And, for us, that's our current
>>>>> stack.
>>>>>> If we were to take Apache Hive as an example, it shipped (and ships)
>>> with
>>>>>> MapReduce as the default configuration engine. At some point, Apache
>>> Tez
>>>>>> came along and wanted Hive to run on Tez, so they made a bunch of
>>> things
>>>>>> pluggable to run Hive on Tez (instead of the only option up-until
>> then:
>>>>>> Hive-on-MR) and then Apache Spark came and re-used some of that
>>>>>> pluggability and even added some more so Hive-on-Spark could become
a
>>>>>> reality. In the same way, I don't think anyone disagrees here that
>>>>>> pluggabilty is a good thing but it's hard to do pluggability right,
>> and
>>>>> at
>>>>>> the right level, unless on has a clear use-case in mind.
>>>>>>
>>>>>> As a project, we have many things to do and I personally think the
>>>>> biggest
>>>>>> bang for the buck for us in making Spot a really solid and the best
>>> cyber
>>>>>> security solution isn't pluggability but the things we are working
on
>>> - a
>>>>>> better user interface, a common/unified approach to storing and
>>> modeling
>>>>>> data, etc.
>>>>>>
>>>>>> Having said that, we are open, if it's important to you or someone
>>> else,
>>>>>> we'd be happy to receive and review those patches.
>>>>>>
>>>>>> Thanks!
>>>>>> Mark
>>>>>>
>>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>>> Thanks Ross! and yes option C sounds good to me as well however I
>> just
>>>>>>> think Distributed Sql query engine and the resource manager should
>> be
>>>>>>> pluggable.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <
>> [email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Option C is to use python on the front end of ingest pipeline and
>>>>>>>> spark/scala on the back end.
>>>>>>>>
>>>>>>>> Option A uses python workers on the backend
>>>>>>>>
>>>>>>>> Option B uses all scala.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: kant kodali [mailto:[email protected]]
>>>>>>>> Sent: Friday, April 14, 2017 9:53 AM
>>>>>>>> To: [email protected]
>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>>>>>>>
>>>>>>>> What is option C ? am I missing an email or something?
>>>>>>>>
>>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> +1 for Python 3.x
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
>>>>>>>>>
>>>>>>>>>> I think that C is the strong solution, getting the ingest really
>>>>>>>>>> strong is going to lower barriers to adoption. Doing it in
Python
>>>>>>>>>> will open up the ingest portion of the project to include many
>>>>> more
>>>>>>>> developers.
>>>>>>>>>>
>>>>>>>>>> Before it comes up I would like to throw the following on the
>>>>>> pile...
>>>>>>>>>> Major
>>>>>>>>>> python projects django/flash, others are dropping 2.x support in
>>>>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects in
>>>>>>>>>> general tend to lag in modern python support, lets please build
>>>>> this
>>>>>>>>>> in 3.5 so that we don't have to immediately expect a rebuild in
>>>>> the
>>>>>>>>>> pipeline.
>>>>>>>>>>
>>>>>>>>>> -Vote C
>>>>>>>>>>
>>>>>>>>>> Thanks Nate
>>>>>>>>>>
>>>>>>>>>> Austin
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]>
>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I really like option C because it gives a lot of flexibility for
>>>>>>>>>> ingest
>>>>>>>>>>> (python vs scala) but still has the robust spark streaming
>>>>> backend
>>>>>>>>>>> for performance.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for putting this together Nate.
>>>>>>>>>>>
>>>>>>>>>>> Alan
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I agree. We should continue making the existing stack more
>> mature
>>>>>> at
>>>>>>>>>>>> this point. Maybe if we have enough community support we can
>> add
>>>>>>>>>>>> additional datastores.
>>>>>>>>>>>>
>>>>>>>>>>>> Chokha.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Kant,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using
>>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based
>>>>> on
>>>>>> a
>>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many
>>>>> pieces
>>>>>>>> yet.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In most Opensource projects you start relying on a well-known
>>>>>>>>>>>>> stack and then you begin to support other DB backends once
>> it's
>>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't
>>>>> been
>>>>>>>>>>>>> ported away from MySQL yet.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In any case, you'll need a high performance SQL + Massive
>>>>> Storage
>>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that can
>> be
>>>>>>>>>>>>> only provided by Hadoop.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kenneth
>>>>>>>>>>>>>
>>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Kenneth,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the response. I think you made a case for HDFS
>>>>>>>>>>>>>> however users may want to use S3 or some other FS in which
>>>>> case
>>>>>>>>>>>>>> they can use Auxilio (hoping that there are no changes
needed
>>>>>>>>>>>>>> within Spot in which case I
>>>>>>>>>>>>>>
>>>>>>>>>>>>> can
>>>>>>>>>>>
>>>>>>>>>>>> agree to that). for example, Netflix stores all there data
into
>>>>> S3
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The distributed sql query engine I would say should be
>>>>> pluggable
>>>>>>>>>>>>>> with whatever user may want to use and there a bunch of them
>>>>> out
>>>>>>>> there.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> sure
>>>>>>>>>>>
>>>>>>>>>>>> Impala is better than hive but what if users are already using
>>>>>>>>>>>>>>
>>>>>>>>>>>>> something
>>>>>>>>>>>
>>>>>>>>>>>> else like Drill or Presto?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Me personally, would not assume that users are willing to
>>>>> deploy
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>
>>>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>>> that and make their existing stack more complicated at very
>>>>> least
>>>>>> I
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>> say it is a uphill battle. Things have been changing rapidly
>>>>> in
>>>>>>>>>>>>>> Big
>>>>>>>>>>>>>>
>>>>>>>>>>>>> data
>>>>>>>>>>>
>>>>>>>>>>>> space so whatever we think is standard won't be standard
>> anymore
>>>>>>>>>>>> but
>>>>>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't
be
>>>>>>>>>>>>>> flexible right.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also I am not sure why only YARN? why not make that also
more
>>>>>>>>>>>>>> flexible so users can pick Mesos or standalone.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think Flexibility is a key for a wide adoption rather than
>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> tightly
>>>>>>>>>>>
>>>>>>>>>>>> coupled architecture.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PS: you need a big data platform to be able to collect all
>>>>> those
>>>>>>>>>>>>>>> netflows
>>>>>>>>>>>>>>> and logs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need
>>>>> loads
>>>>>>>>>>>>>>> of data to get ML working properly, and somewhere to run
>>>>> those
>>>>>>>>>>>>>>> algorithms. That
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> Hadoop.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Kenneth
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sent from my Mi phone
>>>>>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM
>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for starting this thread. Here is my feedback.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I somehow think the architecture is too complicated for
wide
>>>>>>>>>>>>>>> adoption since it requires to install the following.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> HDFS.
>>>>>>>>>>>>>>> HIVE.
>>>>>>>>>>>>>>> IMPALA.
>>>>>>>>>>>>>>> KAFKA.
>>>>>>>>>>>>>>> SPARK (YARN).
>>>>>>>>>>>>>>> YARN.
>>>>>>>>>>>>>>> Zookeeper.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently there are way too many dependencies that
>>>>> discourages
>>>>>>>>>>>>>>> lot of users from using it because they have to go through
>>>>>>>>>>>>>>> deployment of all that required software. I think for wide
>>>>>>>>>>>>>>> option we should minimize the dependencies and have more
>>>>>>>>>>>>>>> pluggable architecture. for example I am
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> not
>>>>>>>>>>>
>>>>>>>>>>>> sure why HIVE & IMPALA both are required? why not just use
>> Spark
>>>>>>>>>>>> SQL
>>>>>>>>>>>>>>> since
>>>>>>>>>>>>>>> its already dependency or say users may want to use their
>> own
>>>>>>>>>>>>>>> distributed query engine they like such as Apache Drill or
>>>>>>>>>>>>>>> something else. we should be flexible enough to provide
that
>>>>>>>>>>>>>>> option
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, I see that HDFS is used such that collectors can
>>>>> receive
>>>>>>>>>>>>>>> file path's through Kafka and be able to read a file. How
>> big
>>>>>>>>>>>>>>> are these files ?
>>>>>>>>>>>>>>> Do we
>>>>>>>>>>>>>>> really need HDFS for this? Why not provide more ways to
send
>>>>>>>>>>>>>>> data such as sending data directly through Kafka or say
just
>>>>>>>>>>>>>>> leaving up to the user to specify the file location as an
>>>>>>>>>>>>>>> argument to collector process
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow data one would
>>>>>>>>>>>>>>> require a specific hardware. This really means Apache Spot
>> is
>>>>>>>>>>>>>>> not meant for everyone.
>>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze the network
>>>>>> traffic
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> any
>>>>>>>>>>>
>>>>>>>>>>>> machine but if it requires a specific hard then I think it is
>>>>>>>>>>>>>>> targeted for
>>>>>>>>>>>>>>> specific group of people.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The real strength of Apache Spot should mainly be just
>>>>>> analyzing
>>>>>>>>>>>>>>> network traffic through ML.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, Nate,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nate.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]]
>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
>>>>>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>>>>>> Cc: [email protected];
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>
>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh well :) Here’s
>> an
>>>>>>>>>>>>>>>> image form:
>>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The diagram became garbled in the text format.
>>>>>>>>>>>>>>>>> Could you resend it as a pdf?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Nate
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]]
>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
>>>>>>>>>>>>>>>>> To: [email protected];
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [email protected];
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> How would you like to see Spot-ingest change?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> A. continue development on the Python Master/Worker with
>>>>>> focus
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> performance / error handling / logging B. Develop Scala
>>>>> based
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ingest to
>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to
>> continue
>>>>>>>>>>>>>>>> being
>>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark
>>>>>> code
>>>>>>>>>>>>>>>> for normalization and input into DB
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Including the high level diagram:
>>>>>>>>>>>>>>>>> +-----------------------------
>>>>> ------------------------------
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -------------------------------+
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | +--------------------------+
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +-----------------+ |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | Master | A. B. C.
>>>>>>> |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Worker | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | A. Python +---------------+ A.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | A.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Python | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | B. Scala | |
>>>>>>> +------------->
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +----+ |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | C. Python | | |
>>>>>>> |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | +---^------+---------------+ | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +-----------------+ | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | | | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | | | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | +Note--------------+ | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +-----------------+ | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | |Running on a | | |
>>>>>>> |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Streaming | | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | |worker node in | | |
>> B.
>>>>>> C.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | B.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Scala | | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | |the Hadoop cluster| | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +--------> C.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Scala +-+ | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | +------------------+ | | |
>>>>>>> |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | | | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | A.| | | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +-----------------+ | | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | B.| | | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | C.| | | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | +----------------------+
+-v------+----+----+-+
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +--------------v--v-+ |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | Local FS: | | hdfs
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hive / Impala | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | - Binary/Text | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Parquet - | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | Log files - | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | | | |
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> | |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> | +----------------------+
+--------------------+
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +-------------------+ |
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +-----------------------------
>>>>> ------------------------------
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -------------------------------+
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please let me know your thoughts,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Nathanael
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Michael Ridley <[email protected]>
>>>> office: (650) 352-1337
>>>> mobile: (571) 438-2420
>>>> Senior Solutions Architect
>>>> Cloudera, Inc.
>>>
>>>
>>
>>
>> --
>> Michael Ridley <[email protected]>
>> office: (650) 352-1337
>> mobile: (571) 438-2420
>> Senior Solutions Architect
>> Cloudera, Inc.
>>