I think the question is rather we can write the data generically to HDFS as parquet without the use of hive/impala?
Today we write parquet data using the hive/mapreduce method. As part of the redesign i’d like to use libraries for this as opposed to a hadoop dependency. I think it would be preferred to use the python master to write the data into the format we want, then do normalization of the data in spark streaming. Any thoughts? - Nathanael > On Apr 17, 2017, at 11:08 AM, Michael Ridley <[email protected]> wrote: > > I had thought that the plan was to write the data in Parquet in HDFS > ultimately. > > Michael > > On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]> wrote: > >> Hi Mark, >> >> Thank you so much for hearing my argument. And I definetly understand that >> you guys have bunch of things to do. My only concern is that I hope it >> doesn't take too long too support other backends. For example @Kenneth had >> given an example of LAMP stack had not moved away from mysql yet which >> essentially means its probably a decade ? I see that in the current >> architecture the results from with python multiprocessing or Spark >> Streaming are written back to HDFS and If so, can we write them in parquet >> format ? such that users should be able to plug in any query engine but >> again I am not pushing you guys to do this right away or anything just >> seeing if there a way for me to get started in parallel and if not >> feasible, its fine I just wanted to share my 2 cents and I am glad my >> argument is heard! >> >> Thanks much! >> >> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> wrote: >> >>> Hi Kant, >>> Just wanted to make sure you don't feel like we are ignoring your >>> comment:-) I hear you about pluggability. >>> >>> The design can and should be pluggable but the project has one stack it >>> ships out of the box with, one stack that's the default stack in the >> sense >>> that it's the most tested and so on. And, for us, that's our current >> stack. >>> If we were to take Apache Hive as an example, it shipped (and ships) with >>> MapReduce as the default configuration engine. At some point, Apache Tez >>> came along and wanted Hive to run on Tez, so they made a bunch of things >>> pluggable to run Hive on Tez (instead of the only option up-until then: >>> Hive-on-MR) and then Apache Spark came and re-used some of that >>> pluggability and even added some more so Hive-on-Spark could become a >>> reality. In the same way, I don't think anyone disagrees here that >>> pluggabilty is a good thing but it's hard to do pluggability right, and >> at >>> the right level, unless on has a clear use-case in mind. >>> >>> As a project, we have many things to do and I personally think the >> biggest >>> bang for the buck for us in making Spot a really solid and the best cyber >>> security solution isn't pluggability but the things we are working on - a >>> better user interface, a common/unified approach to storing and modeling >>> data, etc. >>> >>> Having said that, we are open, if it's important to you or someone else, >>> we'd be happy to receive and review those patches. >>> >>> Thanks! >>> Mark >>> >>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]> >> wrote: >>> >>>> Thanks Ross! and yes option C sounds good to me as well however I just >>>> think Distributed Sql query engine and the resource manager should be >>>> pluggable. >>>> >>>> >>>> >>>> >>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D <[email protected]> >>>> wrote: >>>> >>>>> Option C is to use python on the front end of ingest pipeline and >>>>> spark/scala on the back end. >>>>> >>>>> Option A uses python workers on the backend >>>>> >>>>> Option B uses all scala. >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: kant kodali [mailto:[email protected]] >>>>> Sent: Friday, April 14, 2017 9:53 AM >>>>> To: [email protected] >>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest >>>>> >>>>> What is option C ? am I missing an email or something? >>>>> >>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai < >>>>> [email protected]> wrote: >>>>> >>>>>> +1 for Python 3.x >>>>>> >>>>>> >>>>>> >>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote: >>>>>> >>>>>>> I think that C is the strong solution, getting the ingest really >>>>>>> strong is going to lower barriers to adoption. Doing it in Python >>>>>>> will open up the ingest portion of the project to include many >> more >>>>> developers. >>>>>>> >>>>>>> Before it comes up I would like to throw the following on the >>> pile... >>>>>>> Major >>>>>>> python projects django/flash, others are dropping 2.x support in >>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects in >>>>>>> general tend to lag in modern python support, lets please build >> this >>>>>>> in 3.5 so that we don't have to immediately expect a rebuild in >> the >>>>>>> pipeline. >>>>>>> >>>>>>> -Vote C >>>>>>> >>>>>>> Thanks Nate >>>>>>> >>>>>>> Austin >>>>>>> >>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> >> wrote: >>>>>>> >>>>>>> I really like option C because it gives a lot of flexibility for >>>>>>> ingest >>>>>>>> (python vs scala) but still has the robust spark streaming >> backend >>>>>>>> for performance. >>>>>>>> >>>>>>>> Thanks for putting this together Nate. >>>>>>>> >>>>>>>> Alan >>>>>>>> >>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> I agree. We should continue making the existing stack more mature >>> at >>>>>>>>> this point. Maybe if we have enough community support we can add >>>>>>>>> additional datastores. >>>>>>>>> >>>>>>>>> Chokha. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote: >>>>>>>>> >>>>>>>>>> Hi Kant, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using >>>>>>>>>> Hive+Spark, then sure you'll have YARN. >>>>>>>>>> >>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based >> on >>> a >>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many >> pieces >>>>> yet. >>>>>>>>>> >>>>>>>>>> In most Opensource projects you start relying on a well-known >>>>>>>>>> stack and then you begin to support other DB backends once it's >>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't >> been >>>>>>>>>> ported away from MySQL yet. >>>>>>>>>> >>>>>>>>>> In any case, you'll need a high performance SQL + Massive >> Storage >>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that can be >>>>>>>>>> only provided by Hadoop. >>>>>>>>>> >>>>>>>>>> Regards! >>>>>>>>>> >>>>>>>>>> Kenneth >>>>>>>>>> >>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué: >>>>>>>>>> >>>>>>>>>>> Hi Kenneth, >>>>>>>>>>> >>>>>>>>>>> Thanks for the response. I think you made a case for HDFS >>>>>>>>>>> however users may want to use S3 or some other FS in which >> case >>>>>>>>>>> they can use Auxilio (hoping that there are no changes needed >>>>>>>>>>> within Spot in which case I >>>>>>>>>>> >>>>>>>>>> can >>>>>>>> >>>>>>>>> agree to that). for example, Netflix stores all there data into >> S3 >>>>>>>>>>> >>>>>>>>>>> The distributed sql query engine I would say should be >> pluggable >>>>>>>>>>> with whatever user may want to use and there a bunch of them >> out >>>>> there. >>>>>>>>>>> >>>>>>>>>> sure >>>>>>>> >>>>>>>>> Impala is better than hive but what if users are already using >>>>>>>>>>> >>>>>>>>>> something >>>>>>>> >>>>>>>>> else like Drill or Presto? >>>>>>>>>>> >>>>>>>>>>> Me personally, would not assume that users are willing to >> deploy >>>>>>>>>>> all >>>>>>>>>>> >>>>>>>>>> of >>>>>>>> >>>>>>>>> that and make their existing stack more complicated at very >> least >>> I >>>>>>>>>>> would >>>>>>>>>>> say it is a uphill battle. Things have been changing rapidly >> in >>>>>>>>>>> Big >>>>>>>>>>> >>>>>>>>>> data >>>>>>>> >>>>>>>>> space so whatever we think is standard won't be standard anymore >>>>>>>>> but >>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't be >>>>>>>>>>> flexible right. >>>>>>>>>>> >>>>>>>>>>> Also I am not sure why only YARN? why not make that also more >>>>>>>>>>> flexible so users can pick Mesos or standalone. >>>>>>>>>>> >>>>>>>>>>> I think Flexibility is a key for a wide adoption rather than >> the >>>>>>>>>>> >>>>>>>>>> tightly >>>>>>>> >>>>>>>>> coupled architecture. >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza >>>>>>>>>>> <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> PS: you need a big data platform to be able to collect all >> those >>>>>>>>>>>> netflows >>>>>>>>>>>> and logs. >>>>>>>>>>>> >>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need >> loads >>>>>>>>>>>> of data to get ML working properly, and somewhere to run >> those >>>>>>>>>>>> algorithms. That >>>>>>>>>>>> >>>>>>>>>>> is >>>>>>>> >>>>>>>>> Hadoop. >>>>>>>>>>>> >>>>>>>>>>>> Regards! >>>>>>>>>>>> >>>>>>>>>>>> Kenneth >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Sent from my Mi phone >>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM >>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for starting this thread. Here is my feedback. >>>>>>>>>>>> >>>>>>>>>>>> I somehow think the architecture is too complicated for wide >>>>>>>>>>>> adoption since it requires to install the following. >>>>>>>>>>>> >>>>>>>>>>>> HDFS. >>>>>>>>>>>> HIVE. >>>>>>>>>>>> IMPALA. >>>>>>>>>>>> KAFKA. >>>>>>>>>>>> SPARK (YARN). >>>>>>>>>>>> YARN. >>>>>>>>>>>> Zookeeper. >>>>>>>>>>>> >>>>>>>>>>>> Currently there are way too many dependencies that >> discourages >>>>>>>>>>>> lot of users from using it because they have to go through >>>>>>>>>>>> deployment of all that required software. I think for wide >>>>>>>>>>>> option we should minimize the dependencies and have more >>>>>>>>>>>> pluggable architecture. for example I am >>>>>>>>>>>> >>>>>>>>>>> not >>>>>>>> >>>>>>>>> sure why HIVE & IMPALA both are required? why not just use Spark >>>>>>>>> SQL >>>>>>>>>>>> since >>>>>>>>>>>> its already dependency or say users may want to use their own >>>>>>>>>>>> distributed query engine they like such as Apache Drill or >>>>>>>>>>>> something else. we should be flexible enough to provide that >>>>>>>>>>>> option >>>>>>>>>>>> >>>>>>>>>>>> Also, I see that HDFS is used such that collectors can >> receive >>>>>>>>>>>> file path's through Kafka and be able to read a file. How big >>>>>>>>>>>> are these files ? >>>>>>>>>>>> Do we >>>>>>>>>>>> really need HDFS for this? Why not provide more ways to send >>>>>>>>>>>> data such as sending data directly through Kafka or say just >>>>>>>>>>>> leaving up to the user to specify the file location as an >>>>>>>>>>>> argument to collector process >>>>>>>>>>>> >>>>>>>>>>>> Finally, I learnt that to generate Net flow data one would >>>>>>>>>>>> require a specific hardware. This really means Apache Spot is >>>>>>>>>>>> not meant for everyone. >>>>>>>>>>>> I thought Apache Spot can be used to analyze the network >>> traffic >>>>>>>>>>>> of >>>>>>>>>>>> >>>>>>>>>>> any >>>>>>>> >>>>>>>>> machine but if it requires a specific hard then I think it is >>>>>>>>>>>> targeted for >>>>>>>>>>>> specific group of people. >>>>>>>>>>>> >>>>>>>>>>>> The real strength of Apache Spot should mainly be just >>> analyzing >>>>>>>>>>>> network traffic through ML. >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thanks, Nate, >>>>>>>>>>>>> >>>>>>>>>>>>> Nate. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]] >>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM >>>>>>>>>>>>> To: [email protected] >>>>>>>>>>>>> Cc: [email protected]; >>>>>>>>>>>>> >>>>>>>>>>>> [email protected] >>>>>>>> >>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest >>>>>>>>>>>>> >>>>>>>>>>>>> I was really hoping it came through ok, Oh well :) Here’s an >>>>>>>>>>>>> image form: >>>>>>>>>>>>> http://imgur.com/a/DUDsD >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < >>>>>>>>>>>>>> >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The diagram became garbled in the text format. >>>>>>>>>>>>>> Could you resend it as a pdf? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Nate >>>>>>>>>>>>>> >>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]] >>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM >>>>>>>>>>>>>> To: [email protected]; >>>>>>>>>>>>>> >>>>>>>>>>>>> [email protected]; >>>>>>>>>>>> >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>> >>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest >>>>>>>>>>>>>> >>>>>>>>>>>>>> How would you like to see Spot-ingest change? >>>>>>>>>>>>>> >>>>>>>>>>>>>> A. continue development on the Python Master/Worker with >>> focus >>>>>>>>>>>>>> on >>>>>>>>>>>>>> >>>>>>>>>>>>> performance / error handling / logging B. Develop Scala >> based >>>>>>>>>>>>> >>>>>>>>>>>> ingest to >>>>>>>>>>>> be >>>>>>>>>>>> >>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to continue >>>>>>>>>>>>> being >>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark >>> code >>>>>>>>>>>>> for normalization and input into DB >>>>>>>>>>>>> >>>>>>>>>>>>>> Including the high level diagram: >>>>>>>>>>>>>> +----------------------------- >> ------------------------------ >>>>>>>>>>>>>> >>>>>>>>>>>>> -------------------------------+ >>>>>>>>>>>>> >>>>>>>>>>>>>> | +--------------------------+ >>>>>>>>>>>>>> >>>>>>>>>>>>> +-----------------+ | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | Master | A. B. C. >>>> | >>>>>>>>>>>>>> >>>>>>>>>>>>> Worker | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | A. Python +---------------+ A. >>>>>>>>>>>>>> >>>>>>>>>>>>> | A. >>>>>>>>>>>> >>>>>>>>>>>>> Python | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | B. Scala | | >>>> +-------------> >>>>>>>>>>>>>> >>>>>>>>>>>>> +----+ | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | C. Python | | | >>>> | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | +---^------+---------------+ | | >>>>>>>>>>>>>> >>>>>>>>>>>>> +-----------------+ | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | | | | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | | | | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | +Note--------------+ | | >>>>>>>>>>>>>> >>>>>>>>>>>>> +-----------------+ | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | |Running on a | | | >>>> | >>>>>>>>>>>>>> >>>>>>>>>>>>> Spark >>>>>>>>>>>> >>>>>>>>>>>>> Streaming | | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | |worker node in | | | B. >>> C. >>>>>>>>>>>>>> >>>>>>>>>>>>> | B. >>>>>>>>>>>> >>>>>>>>>>>>> Scala | | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | |the Hadoop cluster| | | >>>>>>>>>>>>>> >>>>>>>>>>>>> +--------> C. >>>>>>>>>>>> >>>>>>>>>>>>> Scala +-+ | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | +------------------+ | | | >>>> | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | A.| | | | >>>>>>>>>>>>>> >>>>>>>>>>>>> +-----------------+ | | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | B.| | | | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | C.| | | | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | +----------------------+ +-v------+----+----+-+ >>>>>>>>>>>>>> >>>>>>>>>>>>> +--------------v--v-+ | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | | | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | Local FS: | | hdfs >>>>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>> >>>>>>>>>>>>> Hive / Impala | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | - Binary/Text | | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>> >>>>>>>>>>>>> - Parquet - | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | Log files - | | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | | | | >>>>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>> >>>>>>>>>>>>> | | >>>>>>>>>>>>> >>>>>>>>>>>>>> | +----------------------+ +--------------------+ >>>>>>>>>>>>>> >>>>>>>>>>>>> +-------------------+ | >>>>>>>>>>>>> >>>>>>>>>>>>>> +----------------------------- >> ------------------------------ >>>>>>>>>>>>>> >>>>>>>>>>>>> -------------------------------+ >>>>>>>>>>>>> >>>>>>>>>>>>>> Please let me know your thoughts, >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Nathanael >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>> >>>>> >>>> >>> >> > > > > -- > Michael Ridley <[email protected]> > office: (650) 352-1337 > mobile: (571) 438-2420 > Senior Solutions Architect > Cloudera, Inc.
