I think the question is rather we can write the data generically to HDFS as 
parquet without the use of hive/impala?

Today we write parquet data using the hive/mapreduce method.
As part of the redesign i’d like to use libraries for this as opposed to a 
hadoop dependency.
I think it would be preferred to use the python master to write the data into 
the format we want, then do normalization of the data in spark streaming.
Any thoughts?

- Nathanael



> On Apr 17, 2017, at 11:08 AM, Michael Ridley <[email protected]> wrote:
> 
> I had thought that the plan was to write the data in Parquet in HDFS
> ultimately.
> 
> Michael
> 
> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]> wrote:
> 
>> Hi Mark,
>> 
>> Thank you so much for hearing my argument. And I definetly understand that
>> you guys have bunch of things to do. My only concern is that I hope it
>> doesn't take too long too support other backends. For example @Kenneth had
>> given an example of LAMP stack had not moved away from mysql yet which
>> essentially means its probably a decade ? I see that in the current
>> architecture the results from with python multiprocessing or Spark
>> Streaming are written back to HDFS and  If so, can we write them in parquet
>> format ? such that users should be able to plug in any query engine but
>> again I am not pushing you guys to do this right away or anything just
>> seeing if there a way for me to get started in parallel and if not
>> feasible, its fine I just wanted to share my 2 cents and I am glad my
>> argument is heard!
>> 
>> Thanks much!
>> 
>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> wrote:
>> 
>>> Hi Kant,
>>> Just wanted to make sure you don't feel like we are ignoring your
>>> comment:-) I hear you about pluggability.
>>> 
>>> The design can and should be pluggable but the project has one stack it
>>> ships out of the box with, one stack that's the default stack in the
>> sense
>>> that it's the most tested and so on. And, for us, that's our current
>> stack.
>>> If we were to take Apache Hive as an example, it shipped (and ships) with
>>> MapReduce as the default configuration engine. At some point, Apache Tez
>>> came along and wanted Hive to run on Tez, so they made a bunch of things
>>> pluggable to run Hive on Tez (instead of the only option up-until then:
>>> Hive-on-MR) and then Apache Spark came and re-used some of that
>>> pluggability and even added some more so Hive-on-Spark could become a
>>> reality. In the same way, I don't think anyone disagrees here that
>>> pluggabilty is a good thing but it's hard to do pluggability right, and
>> at
>>> the right level, unless on has a clear use-case in mind.
>>> 
>>> As a project, we have many things to do and I personally think the
>> biggest
>>> bang for the buck for us in making Spot a really solid and the best cyber
>>> security solution isn't pluggability but the things we are working on - a
>>> better user interface, a common/unified approach to storing and modeling
>>> data, etc.
>>> 
>>> Having said that, we are open, if it's important to you or someone else,
>>> we'd be happy to receive and review those patches.
>>> 
>>> Thanks!
>>> Mark
>>> 
>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]>
>> wrote:
>>> 
>>>> Thanks Ross! and yes option C sounds good to me as well however I just
>>>> think Distributed Sql query engine  and the resource manager should be
>>>> pluggable.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <[email protected]>
>>>> wrote:
>>>> 
>>>>> Option C is to use python on the front end of ingest pipeline and
>>>>> spark/scala on the back end.
>>>>> 
>>>>> Option A uses python workers on the backend
>>>>> 
>>>>> Option B uses all scala.
>>>>> 
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: kant kodali [mailto:[email protected]]
>>>>> Sent: Friday, April 14, 2017 9:53 AM
>>>>> To: [email protected]
>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>>>> 
>>>>> What is option C ? am I missing an email or something?
>>>>> 
>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> +1 for Python 3.x
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote:
>>>>>> 
>>>>>>> I think that C is the strong solution, getting the ingest really
>>>>>>> strong is going to lower barriers to adoption. Doing it in Python
>>>>>>> will open up the ingest portion of the project to include many
>> more
>>>>> developers.
>>>>>>> 
>>>>>>> Before it comes up I would like to throw the following on the
>>> pile...
>>>>>>> Major
>>>>>>> python projects django/flash, others are dropping 2.x support in
>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects in
>>>>>>> general tend to lag in modern python support, lets please build
>> this
>>>>>>> in 3.5 so that we don't have to immediately expect a rebuild in
>> the
>>>>>>> pipeline.
>>>>>>> 
>>>>>>> -Vote C
>>>>>>> 
>>>>>>> Thanks Nate
>>>>>>> 
>>>>>>> Austin
>>>>>>> 
>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]>
>> wrote:
>>>>>>> 
>>>>>>> I really like option C because it gives a lot of flexibility for
>>>>>>> ingest
>>>>>>>> (python vs scala) but still has the robust spark streaming
>> backend
>>>>>>>> for performance.
>>>>>>>> 
>>>>>>>> Thanks for putting this together Nate.
>>>>>>>> 
>>>>>>>> Alan
>>>>>>>> 
>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai <
>>>>>>>> [email protected]> wrote:
>>>>>>>> 
>>>>>>>> I agree. We should continue making the existing stack more mature
>>> at
>>>>>>>>> this point. Maybe if we have enough community support we can add
>>>>>>>>> additional datastores.
>>>>>>>>> 
>>>>>>>>> Chokha.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Kant,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using
>>>>>>>>>> Hive+Spark, then sure you'll have YARN.
>>>>>>>>>> 
>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based
>> on
>>> a
>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many
>> pieces
>>>>> yet.
>>>>>>>>>> 
>>>>>>>>>> In most Opensource projects you start relying on a well-known
>>>>>>>>>> stack and then you begin to support other DB backends once it's
>>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't
>> been
>>>>>>>>>> ported away from MySQL yet.
>>>>>>>>>> 
>>>>>>>>>> In any case, you'll need a high performance SQL + Massive
>> Storage
>>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that can be
>>>>>>>>>> only provided by Hadoop.
>>>>>>>>>> 
>>>>>>>>>> Regards!
>>>>>>>>>> 
>>>>>>>>>> Kenneth
>>>>>>>>>> 
>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué:
>>>>>>>>>> 
>>>>>>>>>>> Hi Kenneth,
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for the response.  I think you made a case for HDFS
>>>>>>>>>>> however users may want to use S3 or some other FS in which
>> case
>>>>>>>>>>> they can use Auxilio (hoping that there are no changes needed
>>>>>>>>>>> within Spot in which case I
>>>>>>>>>>> 
>>>>>>>>>> can
>>>>>>>> 
>>>>>>>>> agree to that). for example, Netflix stores all there data into
>> S3
>>>>>>>>>>> 
>>>>>>>>>>> The distributed sql query engine I would say should be
>> pluggable
>>>>>>>>>>> with whatever user may want to use and there a bunch of them
>> out
>>>>> there.
>>>>>>>>>>> 
>>>>>>>>>> sure
>>>>>>>> 
>>>>>>>>> Impala is better than hive but what if users are already using
>>>>>>>>>>> 
>>>>>>>>>> something
>>>>>>>> 
>>>>>>>>> else like Drill or Presto?
>>>>>>>>>>> 
>>>>>>>>>>> Me personally, would not assume that users are willing to
>> deploy
>>>>>>>>>>> all
>>>>>>>>>>> 
>>>>>>>>>> of
>>>>>>>> 
>>>>>>>>> that and make their existing stack more complicated at very
>> least
>>> I
>>>>>>>>>>> would
>>>>>>>>>>> say it is a uphill battle. Things have been changing rapidly
>> in
>>>>>>>>>>> Big
>>>>>>>>>>> 
>>>>>>>>>> data
>>>>>>>> 
>>>>>>>>> space so whatever we think is standard won't be standard anymore
>>>>>>>>> but
>>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't be
>>>>>>>>>>> flexible right.
>>>>>>>>>>> 
>>>>>>>>>>> Also I am not sure why only YARN? why not make that also more
>>>>>>>>>>> flexible so users can pick Mesos or standalone.
>>>>>>>>>>> 
>>>>>>>>>>> I think Flexibility is a key for a wide adoption rather than
>> the
>>>>>>>>>>> 
>>>>>>>>>> tightly
>>>>>>>> 
>>>>>>>>> coupled architecture.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza
>>>>>>>>>>> <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> PS: you need a big data platform to be able to collect all
>> those
>>>>>>>>>>>> netflows
>>>>>>>>>>>> and logs.
>>>>>>>>>>>> 
>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need
>> loads
>>>>>>>>>>>> of data to get ML working properly, and somewhere to run
>> those
>>>>>>>>>>>> algorithms. That
>>>>>>>>>>>> 
>>>>>>>>>>> is
>>>>>>>> 
>>>>>>>>> Hadoop.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards!
>>>>>>>>>>>> 
>>>>>>>>>>>> Kenneth
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my Mi phone
>>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM
>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for starting this thread. Here is my feedback.
>>>>>>>>>>>> 
>>>>>>>>>>>> I somehow think the architecture is too complicated for wide
>>>>>>>>>>>> adoption since it requires to install the following.
>>>>>>>>>>>> 
>>>>>>>>>>>> HDFS.
>>>>>>>>>>>> HIVE.
>>>>>>>>>>>> IMPALA.
>>>>>>>>>>>> KAFKA.
>>>>>>>>>>>> SPARK (YARN).
>>>>>>>>>>>> YARN.
>>>>>>>>>>>> Zookeeper.
>>>>>>>>>>>> 
>>>>>>>>>>>> Currently there are way too many dependencies that
>> discourages
>>>>>>>>>>>> lot of users from using it because they have to go through
>>>>>>>>>>>> deployment of all that required software. I think for wide
>>>>>>>>>>>> option we should minimize the dependencies and have more
>>>>>>>>>>>> pluggable architecture. for example I am
>>>>>>>>>>>> 
>>>>>>>>>>> not
>>>>>>>> 
>>>>>>>>> sure why HIVE & IMPALA both are required? why not just use Spark
>>>>>>>>> SQL
>>>>>>>>>>>> since
>>>>>>>>>>>> its already dependency or say users may want to use their own
>>>>>>>>>>>> distributed query engine they like such as Apache Drill or
>>>>>>>>>>>> something else. we should be flexible enough to provide that
>>>>>>>>>>>> option
>>>>>>>>>>>> 
>>>>>>>>>>>> Also, I see that HDFS is used such that collectors can
>> receive
>>>>>>>>>>>> file path's through Kafka and be able to read a file. How big
>>>>>>>>>>>> are these files ?
>>>>>>>>>>>> Do we
>>>>>>>>>>>> really need HDFS for this? Why not provide more ways to send
>>>>>>>>>>>> data such as sending data directly through Kafka or say just
>>>>>>>>>>>> leaving up to the user to specify the file location as an
>>>>>>>>>>>> argument to collector process
>>>>>>>>>>>> 
>>>>>>>>>>>> Finally, I learnt that to generate Net flow data one would
>>>>>>>>>>>> require a specific hardware. This really means Apache Spot is
>>>>>>>>>>>> not meant for everyone.
>>>>>>>>>>>> I thought Apache Spot can be used to analyze the network
>>> traffic
>>>>>>>>>>>> of
>>>>>>>>>>>> 
>>>>>>>>>>> any
>>>>>>>> 
>>>>>>>>> machine but if it requires a specific hard then I think it is
>>>>>>>>>>>> targeted for
>>>>>>>>>>>> specific group of people.
>>>>>>>>>>>> 
>>>>>>>>>>>> The real strength of Apache Spot should mainly be just
>>> analyzing
>>>>>>>>>>>> network traffic through ML.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks, Nate,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Nate.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]]
>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM
>>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>>> Cc: [email protected];
>>>>>>>>>>>>> 
>>>>>>>>>>>> [email protected]
>>>>>>>> 
>>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I was really hoping it came through ok, Oh well :) Here’s an
>>>>>>>>>>>>> image form:
>>>>>>>>>>>>> http://imgur.com/a/DUDsD
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L <
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The diagram became garbled in the text format.
>>>>>>>>>>>>>> Could you resend it as a pdf?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Nate
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]]
>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM
>>>>>>>>>>>>>> To: [email protected];
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> [email protected];
>>>>>>>>>>>> 
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> How would you like to see Spot-ingest change?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> A. continue development on the Python Master/Worker with
>>> focus
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> performance / error handling / logging B. Develop Scala
>> based
>>>>>>>>>>>>> 
>>>>>>>>>>>> ingest to
>>>>>>>>>>>> be
>>>>>>>>>>>> 
>>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to continue
>>>>>>>>>>>>> being
>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark
>>> code
>>>>>>>>>>>>> for normalization and input into DB
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Including the high level diagram:
>>>>>>>>>>>>>> +-----------------------------
>> ------------------------------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------+
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | +--------------------------+
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> +-----------------+        |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | | Master                   |  A. B. C.
>>>>  |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> Worker          |        |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | |    A. Python             +---------------+      A.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> |   A.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Python     |        |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | |    B. Scala              |               |
>>>> +------------->
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>          +----+   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | |    C. Python             |               |    |
>>>> |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>          |    |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | +---^------+---------------+               |    |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>  +-----------------+    |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |     |      |                               |    |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>               |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |     |      |                               |    |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>               |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |     |     +Note--------------+             |    |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>  +-----------------+    |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |     |     |Running on a      |             |    |
>>>> |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> Spark
>>>>>>>>>>>> 
>>>>>>>>>>>>> Streaming |    |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |     |     |worker node in    |             |    |      B.
>>> C.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> | B.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Scala        |    |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |     |     |the Hadoop cluster|             |    |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> +--------> C.
>>>>>>>>>>>> 
>>>>>>>>>>>>> Scala        +-+  |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |     |     +------------------+             |    |    |
>>>>  |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>          | |  |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |   A.|                                      |    |    |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> +-----------------+ |  |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |   B.|                                      |    |    |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>             |  |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> |   C.|                                      |    |    |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>             |  |   |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | +----------------------+          +-v------+----+----+-+
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>  +--------------v--v-+ |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | |                      |          |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> |           |
>>>>>>>>>>>> 
>>>>>>>>>>>>>                  | |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | |   Local FS:          |          |    hdfs
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> |           |
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hive / Impala    | |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | |  - Binary/Text       |          |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> |           |
>>>>>>>>>>>> 
>>>>>>>>>>>>>  - Parquet -     | |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | |    Log files -       |          |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> |           |
>>>>>>>>>>>> 
>>>>>>>>>>>>>                  | |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | |                      |          |
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> |           |
>>>>>>>>>>>> 
>>>>>>>>>>>>>                  | |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> | +----------------------+          +--------------------+
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>  +-------------------+ |
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> +-----------------------------
>> ------------------------------
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------+
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Please let me know your thoughts,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> - Nathanael
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Michael Ridley <[email protected]>
> office: (650) 352-1337
> mobile: (571) 438-2420
> Senior Solutions Architect
> Cloudera, Inc.

Reply via email to