If you want to code a quick POC I will run it on our data. This sounds great Austin.
- nathanael > On Apr 20, 2017, at 2:41 PM, Austin Leahy <[email protected]> wrote: > > So this is basically why the flume suggestion has come up. Flume natively > acts as a syslog listener and will write files to basically anything (HDFS, > Hive, HBase, S3). > >> On Thu, Apr 20, 2017 at 8:15 AM Michael Ridley <[email protected]> wrote: >> >> When we say ingest from Kafka, what does that mean? I understand we can >> read from Kafka to ingest into the cluster, but how will the data get to >> Kafka and what data are we talking about? My understanding is that right >> now the primary data sources would be Netflow and Syslog, neither of which >> writes to Kafka natively so we would need something like StreamSets in the >> middle. Certainly StreamSets UDP source -> Kafka would work. >> >> Michael >> >>> On Wed, Apr 19, 2017 at 7:05 PM, kant kodali <[email protected]> wrote: >>> >>> sure I guess Kafka has something called Kafka connect but may not be as >>> mature as flume since I heard about this recently. >>> >>> On Wed, Apr 19, 2017 at 3:39 PM, Austin Leahy <[email protected]> >>> wrote: >>> >>>> The advantage of flume or a flume Kafka hybrid is that the team doesn't >>>> have to build sinks for any new source types added to the project just >>>> create configs pointing to the landing pad >>>> On Wed, Apr 19, 2017 at 3:31 PM kant kodali <[email protected]> >> wrote: >>>> >>>>> What kind of benchmarks are we looking for? just throughput? since I >> am >>>>> assuming this is for ingestion. I haven't seen anything faster than >>> Kafka >>>>> and that is because of its simplicity after all publisher appends >>> message >>>>> to a file(so called the partition in kafka) and clients just do >>>> sequential >>>>> reads from a file so its a matter of disk throughput. The benchmark >>>> numbers >>>>> I have for Kafka is at very least 75K messages/sec where each message >>> is >>>>> 1KB on m4.xlarge which by default has EBS storage (EBS is >>>> network-attached >>>>> SSD disk). The network attached disk has a max throughput of >>>>> 125MB/s(m4.xlarge has 1Gigabit) but if we were deploy it on ephemeral >>>>> storage (local-SSD) and on a 10 Gigabit Network we would easily get >>> 5-10X >>>>> more. >>>>> >>>>> No idea about flume. >>>>> >>>>> Finally, not trying to pitch for Kafka however it is fastest I have >>> seen >>>>> but if someone has better numbers for flume then we should use that. >>>> Also I >>>>> would suspect there are benchmarks for Kafka vs Flume available >> online >>>>> already or we can try it with our own datasets. >>>>> >>>>> Thanks! >>>>> >>>>> On Wed, Apr 19, 2017 at 3:09 PM, Austin Leahy < >>> [email protected]> >>>>> wrote: >>>>> >>>>>> I am happy to create and test a flume source... #intelteam would >> need >>>> to >>>>>> create the benchmark by deploying it and pointing a data source at >>>> it... >>>>>> since I don't have good enough volume of source data handy >>>>>> On Wed, Apr 19, 2017 at 3:04 PM Ross, Alan D < >> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> We discussed this in our staff meeting a bit today. I would like >>> to >>>>> see >>>>>>> some benchmarking of different approaches (kafka, flume, etc) to >>> see >>>>> what >>>>>>> the numbers look like. Is anyone in the community willing to >>>> volunteer >>>>> on >>>>>>> this work? >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Austin Leahy [mailto:[email protected]] >>>>>>> Sent: Wednesday, April 19, 2017 1:05 PM >>>>>>> To: [email protected] >>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest >>>>>>> >>>>>>> I think Kafka is probably a red herring. It's an industry goto in >>> the >>>>>>> application world because because of redundancy but the type and >>>>> volumes >>>>>> of >>>>>>> network telemetry that we are talking about here will bog kafka >>> down >>>>>> unless >>>>>>> you dedicate really serious hardware to just the kafka >>>> implementation. >>>>>> It's >>>>>>> essentially the next level of the problem that the team was >> already >>>>>> running >>>>>>> into when rabbitMQ was queueing in data. >>>>>>> >>>>>>> On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <[email protected]> >>>> wrote: >>>>>>> >>>>>>>> On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Mark, >>>>>>>>> >>>>>>>>> just digesting the below. >>>>>>>>> >>>>>>>>> Backing up in my thought process, I was thinking that the >>> ingest >>>>>>>>> master (first point of entry into the system) would want to >> put >>>> the >>>>>>>>> data into a standard serializable format. I was thinking that >>>>>>>>> libraries (such as pyarrow in this case) could help by >> writing >>>> the >>>>>>>>> data in parquet format early in the process. You are probably >>>>>>>>> correct that at this point in time it might not be worth the >>> time >>>>> and >>>>>>> can be kept in the backlog. >>>>>>>>> That being said, I still think the master should produce data >>> in >>>> a >>>>>>>>> standard format, what in your opinion (and I open this up of >>>> course >>>>>>>>> to >>>>>>>>> others) would be the most logical format? >>>>>>>>> the most basic would be to just keep it as a .csv. >>>>>>>>> >>>>>>>>> The master will likely write data to a staging directory in >>> HDFS >>>>>>>>> where >>>>>>>> the >>>>>>>>> spark streaming job will pick it up for normalization/writing >>> to >>>>>>>>> parquet >>>>>>>> in >>>>>>>>> the correct block sizes and partitions. >>>>>>>>> >>>>>>>> >>>>>>>> Hi Nate, >>>>>>>> Avro is usually preferred for such a standard format - because >> it >>>>>>>> asserts a schema (types, etc.) which CSV doesn't and it allows >>> for >>>>>>>> schema evolution which depending on the type of evolution, CSV >>> may >>>> or >>>>>>> may not support. >>>>>>>> And, that's something I have seen being done very commonly. >>>>>>>> >>>>>>>> Now, if the data were in Kafka before it gets to master, one >>> could >>>>>>>> argue that the master could just send metadata to the workers >>>> (topic >>>>>>>> name, partition number, offset start and end) and the workers >>> could >>>>>>>> read from Kafka directly. I do understand that'd be a much >>>> different >>>>>>>> architecture than the current one, but if you think it's a good >>>> idea >>>>>>>> too, we could document that, say in a JIRA, and (de-)prioritize >>> it >>>>>>>> (and in line with the rest of the discussion on this thread, >> it's >>>> not >>>>>>> the top-most priority). >>>>>>>> Thoughts? >>>>>>>> >>>>>>>> - Nathanael >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]> >>>>> wrote: >>>>>>>>>> >>>>>>>>>> Thanks all your opinion. >>>>>>>>>> >>>>>>>>>> I think it's good to consider two things: >>>>>>>>>> 1. What do (we think) users care about? >>>>>>>>>> 2. What's the cost of changing things? >>>>>>>>>> >>>>>>>>>> About #1, I think users care more about what format data is >>>>>>>>>> written >>>>>>>> than >>>>>>>>>> how the data is written. I'd argue whether that uses Hive, >>> MR, >>>> or >>>>>>>>>> a >>>>>>>>> custom >>>>>>>>>> Parquet writer is not as important to them as long as we >>>> maintain >>>>>>>>>> data/format compatibility. >>>>>>>>>> About #2, having worked on several projects, I find that >> it's >>>>>>>>>> rather difficult to keep up with Parquet. Even in Spark, >>> there >>>>> are >>>>>>>>>> a few >>>>>>>>> different >>>>>>>>>> ways to write to Parquet - there's a regular mode, and a >>> legacy >>>>>>>>>> mode < >> https://github.com/apache/spark/blob/master/sql/core/ >>>>>>>>> src/main/scala/org/apache/spark/sql/execution/ >>>> datasources/parquet/ >>>>>>>>> ParquetWriteSupport.scala#L44> >>>>>>>>>> which >>>>>>>>>> continues to cause confusion >>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20297> till >>> date. >>>>>>>>>> Parquet itself is pretty dependent on Hadoop >>>>>>>>>> <https://github.com/Parquet/parquet-mr/search?l=Maven+POM& >>>>>>>>> q=hadoop&type=&utf8=%E2%9C%93> >>>>>>>>>> and, >>>>>>>>>> just integrating it with systems with a lot of developers >>> (like >>>>>>>>>> Spark < >>>>> https://www.google.com/webhp?sourceid=chrome-instant&ion=1& >>>>>>>>> espv=2&ie=UTF-8#q=spark+parquet+jiras>) >>>>>>>>>> is still a lot of work. >>>>>>>>>> >>>>>>>>>> I personally think we should leverage higher level tools >> like >>>>>>>>>> Hive, or Spark to write data in widespread formats >> (Parquet, >>>>> being >>>>>>>>>> a very good >>>>>>>>>> example) but I personally wouldn't encourage us to manage >> the >>>>>>>>>> writers ourselves. >>>>>>>>>> >>>>>>>>>> Thoughts? >>>>>>>>>> Mark >>>>>>>>>> >>>>>>>>>> On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley >>>>>>>>>> <[email protected] >>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Without having given it too terribly much thought, that >>> seems >>>>>>>>>>> like an >>>>>>>> OK >>>>>>>>>>> approach. >>>>>>>>>>> >>>>>>>>>>> Michael >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith < >>>>>>>> [email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I think the question is rather we can write the data >>>>> generically >>>>>>>>>>>> to >>>>>>>>> HDFS >>>>>>>>>>>> as parquet without the use of hive/impala? >>>>>>>>>>>> >>>>>>>>>>>> Today we write parquet data using the hive/mapreduce >>> method. >>>>>>>>>>>> As part of the redesign i’d like to use libraries for >> this >>> as >>>>>>>>>>>> opposed >>>>>>>>> to >>>>>>>>>>> a >>>>>>>>>>>> hadoop dependency. >>>>>>>>>>>> I think it would be preferred to use the python master to >>>> write >>>>>>>>>>>> the >>>>>>>>> data >>>>>>>>>>>> into the format we want, then do normalization of the >> data >>> in >>>>>>>>>>>> spark streaming. >>>>>>>>>>>> Any thoughts? >>>>>>>>>>>> >>>>>>>>>>>> - Nathanael >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley >>>>>>>>>>>>> <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> I had thought that the plan was to write the data in >>> Parquet >>>>> in >>>>>>>>>>>>> HDFS ultimately. >>>>>>>>>>>>> >>>>>>>>>>>>> Michael >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali >>>>>>>>>>>>> <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Mark, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you so much for hearing my argument. And I >>> definetly >>>>>>>> understand >>>>>>>>>>>> that >>>>>>>>>>>>>> you guys have bunch of things to do. My only concern is >>>> that >>>>> I >>>>>>>>>>>>>> hope >>>>>>>>> it >>>>>>>>>>>>>> doesn't take too long too support other backends. For >>>> example >>>>>>>>> @Kenneth >>>>>>>>>>>> had >>>>>>>>>>>>>> given an example of LAMP stack had not moved away from >>>> mysql >>>>>>>>>>>>>> yet >>>>>>>>> which >>>>>>>>>>>>>> essentially means its probably a decade ? I see that in >>> the >>>>>>>>>>>>>> current architecture the results from with python >>>>>>>>>>>>>> multiprocessing or Spark Streaming are written back to >>> HDFS >>>>>>>>>>>>>> and If so, can we write them in >>>>>>>>>>>> parquet >>>>>>>>>>>>>> format ? such that users should be able to plug in any >>>> query >>>>>>>>>>>>>> engine >>>>>>>>>>> but >>>>>>>>>>>>>> again I am not pushing you guys to do this right away >> or >>>>>>>>>>>>>> anything >>>>>>>>> just >>>>>>>>>>>>>> seeing if there a way for me to get started in parallel >>> and >>>>> if >>>>>>>>>>>>>> not feasible, its fine I just wanted to share my 2 >> cents >>>> and >>>>> I >>>>>>>>>>>>>> am glad >>>>>>>> my >>>>>>>>>>>>>> argument is heard! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks much! >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover < >>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Kant, >>>>>>>>>>>>>>> Just wanted to make sure you don't feel like we are >>>> ignoring >>>>>>>>>>>>>>> your >>>>>>>>>>>>>>> comment:-) I hear you about pluggability. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The design can and should be pluggable but the project >>> has >>>>>>>>>>>>>>> one >>>>>>>> stack >>>>>>>>>>> it >>>>>>>>>>>>>>> ships out of the box with, one stack that's the >> default >>>>> stack >>>>>>>>>>>>>>> in >>>>>>>> the >>>>>>>>>>>>>> sense >>>>>>>>>>>>>>> that it's the most tested and so on. And, for us, >> that's >>>> our >>>>>>>> current >>>>>>>>>>>>>> stack. >>>>>>>>>>>>>>> If we were to take Apache Hive as an example, it >> shipped >>>>> (and >>>>>>>> ships) >>>>>>>>>>>> with >>>>>>>>>>>>>>> MapReduce as the default configuration engine. At some >>>>> point, >>>>>>>> Apache >>>>>>>>>>>> Tez >>>>>>>>>>>>>>> came along and wanted Hive to run on Tez, so they >> made a >>>>>>>>>>>>>>> bunch of >>>>>>>>>>>> things >>>>>>>>>>>>>>> pluggable to run Hive on Tez (instead of the only >> option >>>>>>>>>>>>>>> up-until >>>>>>>>>>> then: >>>>>>>>>>>>>>> Hive-on-MR) and then Apache Spark came and re-used >> some >>> of >>>>>>>>>>>>>>> that pluggability and even added some more so >>>> Hive-on-Spark >>>>>>>>>>>>>>> could >>>>>>>> become >>>>>>>>> a >>>>>>>>>>>>>>> reality. In the same way, I don't think anyone >> disagrees >>>>> here >>>>>>>>>>>>>>> that pluggabilty is a good thing but it's hard to do >>>>>>>>>>>>>>> pluggability >>>>>>>> right, >>>>>>>>>>> and >>>>>>>>>>>>>> at >>>>>>>>>>>>>>> the right level, unless on has a clear use-case in >> mind. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As a project, we have many things to do and I >> personally >>>>>>>>>>>>>>> think the >>>>>>>>>>>>>> biggest >>>>>>>>>>>>>>> bang for the buck for us in making Spot a really solid >>> and >>>>>>>>>>>>>>> the >>>>>>>> best >>>>>>>>>>>> cyber >>>>>>>>>>>>>>> security solution isn't pluggability but the things we >>> are >>>>>>>>>>>>>>> working >>>>>>>>> on >>>>>>>>>>>> - a >>>>>>>>>>>>>>> better user interface, a common/unified approach to >>>> storing >>>>>>>>>>>>>>> and >>>>>>>>>>>> modeling >>>>>>>>>>>>>>> data, etc. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Having said that, we are open, if it's important to >> you >>> or >>>>>>>>>>>>>>> someone >>>>>>>>>>>> else, >>>>>>>>>>>>>>> we'd be happy to receive and review those patches. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>> Mark >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali >>>>>>>>>>>>>>> <[email protected] >>>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks Ross! and yes option C sounds good to me as >> well >>>>>>>>>>>>>>>> however I >>>>>>>>>>> just >>>>>>>>>>>>>>>> think Distributed Sql query engine and the resource >>>>> manager >>>>>>>> should >>>>>>>>>>> be >>>>>>>>>>>>>>>> pluggable. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D < >>>>>>>>>>> [email protected]> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Option C is to use python on the front end of ingest >>>>>>>>>>>>>>>>> pipeline >>>>>>>> and >>>>>>>>>>>>>>>>> spark/scala on the back end. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Option A uses python workers on the backend >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Option B uses all scala. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>>> From: kant kodali [mailto:[email protected]] >>>>>>>>>>>>>>>>> Sent: Friday, April 14, 2017 9:53 AM >>>>>>>>>>>>>>>>> To: [email protected] >>>>>>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for >> Spot-ingest >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> What is option C ? am I missing an email or >> something? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha >> Palayamkottai >>> < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> +1 for Python 3.x >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I think that C is the strong solution, getting the >>>>> ingest >>>>>>>> really >>>>>>>>>>>>>>>>>>> strong is going to lower barriers to adoption. >> Doing >>>> it >>>>>>>>>>>>>>>>>>> in >>>>>>>>> Python >>>>>>>>>>>>>>>>>>> will open up the ingest portion of the project to >>>>> include >>>>>>>>>>>>>>>>>>> many >>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>> developers. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Before it comes up I would like to throw the >>> following >>>>> on >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> pile... >>>>>>>>>>>>>>>>>>> Major >>>>>>>>>>>>>>>>>>> python projects django/flash, others are dropping >>> 2.x >>>>>>>>>>>>>>>>>>> support >>>>>>>> in >>>>>>>>>>>>>>>>>>> releases scheduled in the next 6 to 8 months. >> Hadoop >>>>>>>>>>>>>>>>>>> projects >>>>>>>> in >>>>>>>>>>>>>>>>>>> general tend to lag in modern python support, lets >>>>> please >>>>>>>> build >>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>> in 3.5 so that we don't have to immediately >> expect a >>>>>>>>>>>>>>>>>>> rebuild >>>>>>>> in >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> pipeline. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -Vote C >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks Nate >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Austin >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross >>>>>>>>>>>>>>>>>>> <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I really like option C because it gives a lot of >>>>>>>>>>>>>>>>>>> flexibility >>>>>>>> for >>>>>>>>>>>>>>>>>>> ingest >>>>>>>>>>>>>>>>>>>> (python vs scala) but still has the robust spark >>>>>>>>>>>>>>>>>>>> streaming >>>>>>>>>>>>>> backend >>>>>>>>>>>>>>>>>>>> for performance. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for putting this together Nate. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Alan >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha >>>> Palayamkottai < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I agree. We should continue making the existing >>> stack >>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>> mature >>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>>>>> this point. Maybe if we have enough community >>>> support >>>>>>>>>>>>>>>>>>>>> we can >>>>>>>>>>> add >>>>>>>>>>>>>>>>>>>>> additional datastores. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Chokha. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Kant, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If >>> you're >>>>>>>>>>>>>>>>>>>>>> using >>>>>>>>>>>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, >>>> Spot >>>>>>>>>>>>>>>>>>>>>> is >>>>>>>> based >>>>>>>>>>>>>> on >>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't >> switch >>>> too >>>>>>>>>>>>>>>>>>>>>> many >>>>>>>>>>>>>> pieces >>>>>>>>>>>>>>>>> yet. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> In most Opensource projects you start relying >> on >>> a >>>>>>>> well-known >>>>>>>>>>>>>>>>>>>>>> stack and then you begin to support other DB >>>> backends >>>>>>>>>>>>>>>>>>>>>> once >>>>>>>>>>> it's >>>>>>>>>>>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps >>> which >>>>>>>>>>>>>>>>>>>>>> haven't >>>>>>>>>>>>>> been >>>>>>>>>>>>>>>>>>>>>> ported away from MySQL yet. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> In any case, you'll need a high performance >> SQL + >>>>>>>>>>>>>>>>>>>>>> Massive >>>>>>>>>>>>>> Storage >>>>>>>>>>>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... >>> ATM, >>>>>>>>>>>>>>>>>>>>>> + that >>>>>>>> can >>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>>>> only provided by Hadoop. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Regards! >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Kenneth >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Kenneth, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks for the response. I think you made a >>> case >>>>> for >>>>>>>>>>>>>>>>>>>>>>> HDFS however users may want to use S3 or some >>>> other >>>>>>>>>>>>>>>>>>>>>>> FS in which >>>>>>>>>>>>>> case >>>>>>>>>>>>>>>>>>>>>>> they can use Auxilio (hoping that there are no >>>>>>>>>>>>>>>>>>>>>>> changes >>>>>>>>> needed >>>>>>>>>>>>>>>>>>>>>>> within Spot in which case I >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> agree to that). for example, Netflix stores all >>>> there >>>>>>>>>>>>>>>>>>>>> data >>>>>>>>> into >>>>>>>>>>>>>> S3 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The distributed sql query engine I would say >>>> should >>>>>>>>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>> pluggable >>>>>>>>>>>>>>>>>>>>>>> with whatever user may want to use and there a >>>> bunch >>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>> them >>>>>>>>>>>>>> out >>>>>>>>>>>>>>>>> there. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> sure >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Impala is better than hive but what if users are >>>>>>>>>>>>>>>>>>>>> already >>>>>>>> using >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> something >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> else like Drill or Presto? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Me personally, would not assume that users are >>>>>>>>>>>>>>>>>>>>>>> willing to >>>>>>>>>>>>>> deploy >>>>>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> that and make their existing stack more >>> complicated >>>> at >>>>>>>>>>>>>>>>>>>>> very >>>>>>>>>>>>>> least >>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>>>>>>>> would >>>>>>>>>>>>>>>>>>>>>>> say it is a uphill battle. Things have been >>>> changing >>>>>>>> rapidly >>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>> Big >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> space so whatever we think is standard won't be >>>>>>>>>>>>>>>>>>>>> standard >>>>>>>>>>> anymore >>>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>>>> importantly there shouldn't be any reason why >> we >>>>>>>>>>>>>>>>>>>>>>> shouldn't >>>>>>>>> be >>>>>>>>>>>>>>>>>>>>>>> flexible right. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Also I am not sure why only YARN? why not make >>>> that >>>>>>>>>>>>>>>>>>>>>>> also >>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>>> flexible so users can pick Mesos or >> standalone. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I think Flexibility is a key for a wide >> adoption >>>>>>>>>>>>>>>>>>>>>>> rather >>>>>>>> than >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> tightly >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> coupled architecture. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth >> Peiruza >>>>>>>>>>>>>>>>>>>>>>> <[email protected]> >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> PS: you need a big data platform to be able to >>>>>>>>>>>>>>>>>>>>>>> collect all >>>>>>>>>>>>>> those >>>>>>>>>>>>>>>>>>>>>>>> netflows >>>>>>>>>>>>>>>>>>>>>>>> and logs. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, >>> then >>>>> you >>>>>>>>>>>>>>>>>>>>>>>> need >>>>>>>>>>>>>> loads >>>>>>>>>>>>>>>>>>>>>>>> of data to get ML working properly, and >>> somewhere >>>>> to >>>>>>>>>>>>>>>>>>>>>>>> run >>>>>>>>>>>>>> those >>>>>>>>>>>>>>>>>>>>>>>> algorithms. That >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hadoop. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Regards! >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Kenneth >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Sent from my Mi phone On kant kodali >>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>, Apr 14, 2017 4:04 >>>>>>>> AM >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks for starting this thread. Here is my >>>>> feedback. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I somehow think the architecture is too >>>> complicated >>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>>> wide >>>>>>>>>>>>>>>>>>>>>>>> adoption since it requires to install the >>>>> following. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> HDFS. >>>>>>>>>>>>>>>>>>>>>>>> HIVE. >>>>>>>>>>>>>>>>>>>>>>>> IMPALA. >>>>>>>>>>>>>>>>>>>>>>>> KAFKA. >>>>>>>>>>>>>>>>>>>>>>>> SPARK (YARN). >>>>>>>>>>>>>>>>>>>>>>>> YARN. >>>>>>>>>>>>>>>>>>>>>>>> Zookeeper. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Currently there are way too many dependencies >>>> that >>>>>>>>>>>>>> discourages >>>>>>>>>>>>>>>>>>>>>>>> lot of users from using it because they have >> to >>>> go >>>>>>>> through >>>>>>>>>>>>>>>>>>>>>>>> deployment of all that required software. I >>> think >>>>>>>>>>>>>>>>>>>>>>>> for >>>>>>>> wide >>>>>>>>>>>>>>>>>>>>>>>> option we should minimize the dependencies >> and >>>> have >>>>>>>>>>>>>>>>>>>>>>>> more pluggable architecture. for example I am >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> sure why HIVE & IMPALA both are required? why >> not >>>> just >>>>>>>>>>>>>>>>>>>>> use >>>>>>>>>>> Spark >>>>>>>>>>>>>>>>>>>>> SQL >>>>>>>>>>>>>>>>>>>>>>>> since >>>>>>>>>>>>>>>>>>>>>>>> its already dependency or say users may want >> to >>>> use >>>>>>>>>>>>>>>>>>>>>>>> their >>>>>>>>>>> own >>>>>>>>>>>>>>>>>>>>>>>> distributed query engine they like such as >>> Apache >>>>>>>>>>>>>>>>>>>>>>>> Drill >>>>>>>> or >>>>>>>>>>>>>>>>>>>>>>>> something else. we should be flexible enough >> to >>>>>>>>>>>>>>>>>>>>>>>> provide >>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>> option >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Also, I see that HDFS is used such that >>>> collectors >>>>>>>>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>> receive >>>>>>>>>>>>>>>>>>>>>>>> file path's through Kafka and be able to >> read a >>>>>>>>>>>>>>>>>>>>>>>> file. How >>>>>>>>>>> big >>>>>>>>>>>>>>>>>>>>>>>> are these files ? >>>>>>>>>>>>>>>>>>>>>>>> Do we >>>>>>>>>>>>>>>>>>>>>>>> really need HDFS for this? Why not provide >> more >>>>> ways >>>>>>>>>>>>>>>>>>>>>>>> to >>>>>>>>> send >>>>>>>>>>>>>>>>>>>>>>>> data such as sending data directly through >>> Kafka >>>> or >>>>>>>>>>>>>>>>>>>>>>>> say >>>>>>>>> just >>>>>>>>>>>>>>>>>>>>>>>> leaving up to the user to specify the file >>>> location >>>>>>>>>>>>>>>>>>>>>>>> as an argument to collector process >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow >>> data >>>>> one >>>>>>>> would >>>>>>>>>>>>>>>>>>>>>>>> require a specific hardware. This really >> means >>>>>>>>>>>>>>>>>>>>>>>> Apache >>>>>>>> Spot >>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>> not meant for everyone. >>>>>>>>>>>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze >>> the >>>>>>>>>>>>>>>>>>>>>>>> network >>>>>>>>>>>>>>> traffic >>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> any >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> machine but if it requires a specific hard then >> I >>>>> think >>>>>>>>>>>>>>>>>>>>> it >>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>> targeted for >>>>>>>>>>>>>>>>>>>>>>>> specific group of people. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The real strength of Apache Spot should >> mainly >>> be >>>>>>>>>>>>>>>>>>>>>>>> just >>>>>>>>>>>>>>> analyzing >>>>>>>>>>>>>>>>>>>>>>>> network traffic through ML. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, >>>> Nathan >>>>> L >>>>>>>>>>>>>>>>>>>>>>>> < [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, Nate, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Nate. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>>>>>>>>>>> From: Nate Smith [mailto: >>> [email protected]] >>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM >>>>>>>>>>>>>>>>>>>>>>>>> To: [email protected] >>>>>>>>>>>>>>>>>>>>>>>>> Cc: [email protected]; >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for >>>> Spot-ingest >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh >>> well >>>> :) >>>>>>>> Here’s >>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>>>>>>> image form: >>>>>>>>>>>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, >> Nathan >>>> L < >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> The diagram became garbled in the text >>> format. >>>>>>>>>>>>>>>>>>>>>>>>>> Could you resend it as a pdf? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>>> Nate >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>>>>>>>>>>>> From: Nathanael Smith >>>>>>>>>>>>>>>>>>>>>>>>>> [mailto:[email protected]] >>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM >>>>>>>>>>>>>>>>>>>>>>>>>> To: [email protected]; >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]; >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for >>>> Spot-ingest >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> How would you like to see Spot-ingest >> change? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> A. continue development on the Python >>>>>>>>>>>>>>>>>>>>>>>>>> Master/Worker >>>>>>>> with >>>>>>>>>>>>>>> focus >>>>>>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> performance / error handling / logging B. >>>> Develop >>>>>>>>>>>>>>>>>>>>>>>>> Scala >>>>>>>>>>>>>> based >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> ingest to >>>>>>>>>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA >>> (UI >>>>> to >>>>>>>>>>> continue >>>>>>>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with >> Scala >>>>>>>>>>>>>>>>>>>>>>>>> based >>>>>>>> Spark >>>>>>>>>>>>>>> code >>>>>>>>>>>>>>>>>>>>>>>>> for normalization and input into DB >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Including the high level diagram: >>>>>>>>>>>>>>>>>>>>>>>>>> +----------------------------- >>>>>>>>>>>>>> ------------------------------ >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -------------------------------+ >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | +--------------------------+ >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | Master | A. B. C. >>>>>>>>>>>>>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Worker | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | A. Python >>> +---------------+ >>>>>>> A. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | A. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Python | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | B. Scala | >>> | >>>>>>>>>>>>>>>> +-------------> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> +----+ | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | C. Python | >>> | >>>>> | >>>>>>>>>>>>>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | +---^------+---------------+ >>> | >>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | | >>> | >>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | | >>> | >>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | +Note--------------+ >>> | >>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | |Running on a | >>> | >>>>> | >>>>>>>>>>>>>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Spark >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Streaming | | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | |worker node in | >>> | >>>>> | >>>>>>>>>>> B. >>>>>>>>>>>>>>> C. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | B. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Scala | | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | |the Hadoop cluster| >>> | >>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> +--------> C. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Scala +-+ | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | +------------------+ >>> | >>>>> | >>>>>>>> | >>>>>>>>>>>>>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | A.| >>> | >>>>> | >>>>>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ | | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | B.| >>> | >>>>> | >>>>>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | C.| >>> | >>>>> | >>>>>>>> | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | +----------------------+ >>>>>>>>> +-v------+----+----+-+ >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> +--------------v--v-+ | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | | | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | Local FS: | | >> hdfs >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hive / Impala | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | - Binary/Text | | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> - Parquet - | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | Log files - | | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | | | | >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> | | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> | +----------------------+ >>>>>>>>> +--------------------+ >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> +-------------------+ | >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> +----------------------------- >>>>>>>>>>>>>> ------------------------------ >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -------------------------------+ >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Please let me know your thoughts, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> - Nathanael >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Michael Ridley <[email protected]> >>>>>>>>>>>>> office: (650) 352-1337 >>>>>>>>>>>>> mobile: (571) 438-2420 >>>>>>>>>>>>> Senior Solutions Architect >>>>>>>>>>>>> Cloudera, Inc. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Michael Ridley <[email protected]> >>>>>>>>>>> office: (650) 352-1337 >>>>>>>>>>> mobile: (571) 438-2420 >>>>>>>>>>> Senior Solutions Architect >>>>>>>>>>> Cloudera, Inc. >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >> >> -- >> Michael Ridley <[email protected]> >> office: (650) 352-1337 >> mobile: (571) 438-2420 >> Senior Solutions Architect >> Cloudera, Inc. >>
