@Micheal Ridley There are few ways to do this. 1. There is File Source in Connector in Kafka Connect!
http://docs.confluent.io/3.1.0/connect/connect-filestream/filestream_connector.html#filesource-connector 2. Any file listener API from any language combined with Kafka Producer would also work! https://github.com/seb-m/pyinotify/wiki https://github.com/seb-m/pyinotify/wiki/List-of-Examples On Thu, Apr 20, 2017 at 7:49 PM, Smith, Nathanael P < [email protected]> wrote: > If you want to code a quick POC I will run it on our data. This sounds > great Austin. > > - nathanael > > > On Apr 20, 2017, at 2:41 PM, Austin Leahy <[email protected]> > wrote: > > > > So this is basically why the flume suggestion has come up. Flume natively > > acts as a syslog listener and will write files to basically anything > (HDFS, > > Hive, HBase, S3). > > > >> On Thu, Apr 20, 2017 at 8:15 AM Michael Ridley <[email protected]> > wrote: > >> > >> When we say ingest from Kafka, what does that mean? I understand we can > >> read from Kafka to ingest into the cluster, but how will the data get to > >> Kafka and what data are we talking about? My understanding is that > right > >> now the primary data sources would be Netflow and Syslog, neither of > which > >> writes to Kafka natively so we would need something like StreamSets in > the > >> middle. Certainly StreamSets UDP source -> Kafka would work. > >> > >> Michael > >> > >>> On Wed, Apr 19, 2017 at 7:05 PM, kant kodali <[email protected]> > wrote: > >>> > >>> sure I guess Kafka has something called Kafka connect but may not be as > >>> mature as flume since I heard about this recently. > >>> > >>> On Wed, Apr 19, 2017 at 3:39 PM, Austin Leahy < > [email protected]> > >>> wrote: > >>> > >>>> The advantage of flume or a flume Kafka hybrid is that the team > doesn't > >>>> have to build sinks for any new source types added to the project just > >>>> create configs pointing to the landing pad > >>>> On Wed, Apr 19, 2017 at 3:31 PM kant kodali <[email protected]> > >> wrote: > >>>> > >>>>> What kind of benchmarks are we looking for? just throughput? since I > >> am > >>>>> assuming this is for ingestion. I haven't seen anything faster than > >>> Kafka > >>>>> and that is because of its simplicity after all publisher appends > >>> message > >>>>> to a file(so called the partition in kafka) and clients just do > >>>> sequential > >>>>> reads from a file so its a matter of disk throughput. The benchmark > >>>> numbers > >>>>> I have for Kafka is at very least 75K messages/sec where each message > >>> is > >>>>> 1KB on m4.xlarge which by default has EBS storage (EBS is > >>>> network-attached > >>>>> SSD disk). The network attached disk has a max throughput of > >>>>> 125MB/s(m4.xlarge has 1Gigabit) but if we were deploy it on ephemeral > >>>>> storage (local-SSD) and on a 10 Gigabit Network we would easily get > >>> 5-10X > >>>>> more. > >>>>> > >>>>> No idea about flume. > >>>>> > >>>>> Finally, not trying to pitch for Kafka however it is fastest I have > >>> seen > >>>>> but if someone has better numbers for flume then we should use that. > >>>> Also I > >>>>> would suspect there are benchmarks for Kafka vs Flume available > >> online > >>>>> already or we can try it with our own datasets. > >>>>> > >>>>> Thanks! > >>>>> > >>>>> On Wed, Apr 19, 2017 at 3:09 PM, Austin Leahy < > >>> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> I am happy to create and test a flume source... #intelteam would > >> need > >>>> to > >>>>>> create the benchmark by deploying it and pointing a data source at > >>>> it... > >>>>>> since I don't have good enough volume of source data handy > >>>>>> On Wed, Apr 19, 2017 at 3:04 PM Ross, Alan D < > >> [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> We discussed this in our staff meeting a bit today. I would like > >>> to > >>>>> see > >>>>>>> some benchmarking of different approaches (kafka, flume, etc) to > >>> see > >>>>> what > >>>>>>> the numbers look like. Is anyone in the community willing to > >>>> volunteer > >>>>> on > >>>>>>> this work? > >>>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: Austin Leahy [mailto:[email protected]] > >>>>>>> Sent: Wednesday, April 19, 2017 1:05 PM > >>>>>>> To: [email protected] > >>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > >>>>>>> > >>>>>>> I think Kafka is probably a red herring. It's an industry goto in > >>> the > >>>>>>> application world because because of redundancy but the type and > >>>>> volumes > >>>>>> of > >>>>>>> network telemetry that we are talking about here will bog kafka > >>> down > >>>>>> unless > >>>>>>> you dedicate really serious hardware to just the kafka > >>>> implementation. > >>>>>> It's > >>>>>>> essentially the next level of the problem that the team was > >> already > >>>>>> running > >>>>>>> into when rabbitMQ was queueing in data. > >>>>>>> > >>>>>>> On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <[email protected]> > >>>> wrote: > >>>>>>> > >>>>>>>> On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P < > >>>>>>>> [email protected]> wrote: > >>>>>>>> > >>>>>>>>> Mark, > >>>>>>>>> > >>>>>>>>> just digesting the below. > >>>>>>>>> > >>>>>>>>> Backing up in my thought process, I was thinking that the > >>> ingest > >>>>>>>>> master (first point of entry into the system) would want to > >> put > >>>> the > >>>>>>>>> data into a standard serializable format. I was thinking that > >>>>>>>>> libraries (such as pyarrow in this case) could help by > >> writing > >>>> the > >>>>>>>>> data in parquet format early in the process. You are probably > >>>>>>>>> correct that at this point in time it might not be worth the > >>> time > >>>>> and > >>>>>>> can be kept in the backlog. > >>>>>>>>> That being said, I still think the master should produce data > >>> in > >>>> a > >>>>>>>>> standard format, what in your opinion (and I open this up of > >>>> course > >>>>>>>>> to > >>>>>>>>> others) would be the most logical format? > >>>>>>>>> the most basic would be to just keep it as a .csv. > >>>>>>>>> > >>>>>>>>> The master will likely write data to a staging directory in > >>> HDFS > >>>>>>>>> where > >>>>>>>> the > >>>>>>>>> spark streaming job will pick it up for normalization/writing > >>> to > >>>>>>>>> parquet > >>>>>>>> in > >>>>>>>>> the correct block sizes and partitions. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hi Nate, > >>>>>>>> Avro is usually preferred for such a standard format - because > >> it > >>>>>>>> asserts a schema (types, etc.) which CSV doesn't and it allows > >>> for > >>>>>>>> schema evolution which depending on the type of evolution, CSV > >>> may > >>>> or > >>>>>>> may not support. > >>>>>>>> And, that's something I have seen being done very commonly. > >>>>>>>> > >>>>>>>> Now, if the data were in Kafka before it gets to master, one > >>> could > >>>>>>>> argue that the master could just send metadata to the workers > >>>> (topic > >>>>>>>> name, partition number, offset start and end) and the workers > >>> could > >>>>>>>> read from Kafka directly. I do understand that'd be a much > >>>> different > >>>>>>>> architecture than the current one, but if you think it's a good > >>>> idea > >>>>>>>> too, we could document that, say in a JIRA, and (de-)prioritize > >>> it > >>>>>>>> (and in line with the rest of the discussion on this thread, > >> it's > >>>> not > >>>>>>> the top-most priority). > >>>>>>>> Thoughts? > >>>>>>>> > >>>>>>>> - Nathanael > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]> > >>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Thanks all your opinion. > >>>>>>>>>> > >>>>>>>>>> I think it's good to consider two things: > >>>>>>>>>> 1. What do (we think) users care about? > >>>>>>>>>> 2. What's the cost of changing things? > >>>>>>>>>> > >>>>>>>>>> About #1, I think users care more about what format data is > >>>>>>>>>> written > >>>>>>>> than > >>>>>>>>>> how the data is written. I'd argue whether that uses Hive, > >>> MR, > >>>> or > >>>>>>>>>> a > >>>>>>>>> custom > >>>>>>>>>> Parquet writer is not as important to them as long as we > >>>> maintain > >>>>>>>>>> data/format compatibility. > >>>>>>>>>> About #2, having worked on several projects, I find that > >> it's > >>>>>>>>>> rather difficult to keep up with Parquet. Even in Spark, > >>> there > >>>>> are > >>>>>>>>>> a few > >>>>>>>>> different > >>>>>>>>>> ways to write to Parquet - there's a regular mode, and a > >>> legacy > >>>>>>>>>> mode < > >> https://github.com/apache/spark/blob/master/sql/core/ > >>>>>>>>> src/main/scala/org/apache/spark/sql/execution/ > >>>> datasources/parquet/ > >>>>>>>>> ParquetWriteSupport.scala#L44> > >>>>>>>>>> which > >>>>>>>>>> continues to cause confusion > >>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20297> till > >>> date. > >>>>>>>>>> Parquet itself is pretty dependent on Hadoop > >>>>>>>>>> <https://github.com/Parquet/parquet-mr/search?l=Maven+POM& > >>>>>>>>> q=hadoop&type=&utf8=%E2%9C%93> > >>>>>>>>>> and, > >>>>>>>>>> just integrating it with systems with a lot of developers > >>> (like > >>>>>>>>>> Spark < > >>>>> https://www.google.com/webhp?sourceid=chrome-instant&ion=1& > >>>>>>>>> espv=2&ie=UTF-8#q=spark+parquet+jiras>) > >>>>>>>>>> is still a lot of work. > >>>>>>>>>> > >>>>>>>>>> I personally think we should leverage higher level tools > >> like > >>>>>>>>>> Hive, or Spark to write data in widespread formats > >> (Parquet, > >>>>> being > >>>>>>>>>> a very good > >>>>>>>>>> example) but I personally wouldn't encourage us to manage > >> the > >>>>>>>>>> writers ourselves. > >>>>>>>>>> > >>>>>>>>>> Thoughts? > >>>>>>>>>> Mark > >>>>>>>>>> > >>>>>>>>>> On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley > >>>>>>>>>> <[email protected] > >>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Without having given it too terribly much thought, that > >>> seems > >>>>>>>>>>> like an > >>>>>>>> OK > >>>>>>>>>>> approach. > >>>>>>>>>>> > >>>>>>>>>>> Michael > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith < > >>>>>>>> [email protected]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> I think the question is rather we can write the data > >>>>> generically > >>>>>>>>>>>> to > >>>>>>>>> HDFS > >>>>>>>>>>>> as parquet without the use of hive/impala? > >>>>>>>>>>>> > >>>>>>>>>>>> Today we write parquet data using the hive/mapreduce > >>> method. > >>>>>>>>>>>> As part of the redesign i’d like to use libraries for > >> this > >>> as > >>>>>>>>>>>> opposed > >>>>>>>>> to > >>>>>>>>>>> a > >>>>>>>>>>>> hadoop dependency. > >>>>>>>>>>>> I think it would be preferred to use the python master to > >>>> write > >>>>>>>>>>>> the > >>>>>>>>> data > >>>>>>>>>>>> into the format we want, then do normalization of the > >> data > >>> in > >>>>>>>>>>>> spark streaming. > >>>>>>>>>>>> Any thoughts? > >>>>>>>>>>>> > >>>>>>>>>>>> - Nathanael > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley > >>>>>>>>>>>>> <[email protected]> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> I had thought that the plan was to write the data in > >>> Parquet > >>>>> in > >>>>>>>>>>>>> HDFS ultimately. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Michael > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali > >>>>>>>>>>>>> <[email protected]> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Mark, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thank you so much for hearing my argument. And I > >>> definetly > >>>>>>>> understand > >>>>>>>>>>>> that > >>>>>>>>>>>>>> you guys have bunch of things to do. My only concern is > >>>> that > >>>>> I > >>>>>>>>>>>>>> hope > >>>>>>>>> it > >>>>>>>>>>>>>> doesn't take too long too support other backends. For > >>>> example > >>>>>>>>> @Kenneth > >>>>>>>>>>>> had > >>>>>>>>>>>>>> given an example of LAMP stack had not moved away from > >>>> mysql > >>>>>>>>>>>>>> yet > >>>>>>>>> which > >>>>>>>>>>>>>> essentially means its probably a decade ? I see that in > >>> the > >>>>>>>>>>>>>> current architecture the results from with python > >>>>>>>>>>>>>> multiprocessing or Spark Streaming are written back to > >>> HDFS > >>>>>>>>>>>>>> and If so, can we write them in > >>>>>>>>>>>> parquet > >>>>>>>>>>>>>> format ? such that users should be able to plug in any > >>>> query > >>>>>>>>>>>>>> engine > >>>>>>>>>>> but > >>>>>>>>>>>>>> again I am not pushing you guys to do this right away > >> or > >>>>>>>>>>>>>> anything > >>>>>>>>> just > >>>>>>>>>>>>>> seeing if there a way for me to get started in parallel > >>> and > >>>>> if > >>>>>>>>>>>>>> not feasible, its fine I just wanted to share my 2 > >> cents > >>>> and > >>>>> I > >>>>>>>>>>>>>> am glad > >>>>>>>> my > >>>>>>>>>>>>>> argument is heard! > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks much! > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover < > >>>>> [email protected]> > >>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi Kant, > >>>>>>>>>>>>>>> Just wanted to make sure you don't feel like we are > >>>> ignoring > >>>>>>>>>>>>>>> your > >>>>>>>>>>>>>>> comment:-) I hear you about pluggability. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The design can and should be pluggable but the project > >>> has > >>>>>>>>>>>>>>> one > >>>>>>>> stack > >>>>>>>>>>> it > >>>>>>>>>>>>>>> ships out of the box with, one stack that's the > >> default > >>>>> stack > >>>>>>>>>>>>>>> in > >>>>>>>> the > >>>>>>>>>>>>>> sense > >>>>>>>>>>>>>>> that it's the most tested and so on. And, for us, > >> that's > >>>> our > >>>>>>>> current > >>>>>>>>>>>>>> stack. > >>>>>>>>>>>>>>> If we were to take Apache Hive as an example, it > >> shipped > >>>>> (and > >>>>>>>> ships) > >>>>>>>>>>>> with > >>>>>>>>>>>>>>> MapReduce as the default configuration engine. At some > >>>>> point, > >>>>>>>> Apache > >>>>>>>>>>>> Tez > >>>>>>>>>>>>>>> came along and wanted Hive to run on Tez, so they > >> made a > >>>>>>>>>>>>>>> bunch of > >>>>>>>>>>>> things > >>>>>>>>>>>>>>> pluggable to run Hive on Tez (instead of the only > >> option > >>>>>>>>>>>>>>> up-until > >>>>>>>>>>> then: > >>>>>>>>>>>>>>> Hive-on-MR) and then Apache Spark came and re-used > >> some > >>> of > >>>>>>>>>>>>>>> that pluggability and even added some more so > >>>> Hive-on-Spark > >>>>>>>>>>>>>>> could > >>>>>>>> become > >>>>>>>>> a > >>>>>>>>>>>>>>> reality. In the same way, I don't think anyone > >> disagrees > >>>>> here > >>>>>>>>>>>>>>> that pluggabilty is a good thing but it's hard to do > >>>>>>>>>>>>>>> pluggability > >>>>>>>> right, > >>>>>>>>>>> and > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>>> the right level, unless on has a clear use-case in > >> mind. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> As a project, we have many things to do and I > >> personally > >>>>>>>>>>>>>>> think the > >>>>>>>>>>>>>> biggest > >>>>>>>>>>>>>>> bang for the buck for us in making Spot a really solid > >>> and > >>>>>>>>>>>>>>> the > >>>>>>>> best > >>>>>>>>>>>> cyber > >>>>>>>>>>>>>>> security solution isn't pluggability but the things we > >>> are > >>>>>>>>>>>>>>> working > >>>>>>>>> on > >>>>>>>>>>>> - a > >>>>>>>>>>>>>>> better user interface, a common/unified approach to > >>>> storing > >>>>>>>>>>>>>>> and > >>>>>>>>>>>> modeling > >>>>>>>>>>>>>>> data, etc. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Having said that, we are open, if it's important to > >> you > >>> or > >>>>>>>>>>>>>>> someone > >>>>>>>>>>>> else, > >>>>>>>>>>>>>>> we'd be happy to receive and review those patches. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>> Mark > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali > >>>>>>>>>>>>>>> <[email protected] > >>>>>>>>> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks Ross! and yes option C sounds good to me as > >> well > >>>>>>>>>>>>>>>> however I > >>>>>>>>>>> just > >>>>>>>>>>>>>>>> think Distributed Sql query engine and the resource > >>>>> manager > >>>>>>>> should > >>>>>>>>>>> be > >>>>>>>>>>>>>>>> pluggable. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D < > >>>>>>>>>>> [email protected]> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Option C is to use python on the front end of ingest > >>>>>>>>>>>>>>>>> pipeline > >>>>>>>> and > >>>>>>>>>>>>>>>>> spark/scala on the back end. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Option A uses python workers on the backend > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Option B uses all scala. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>>>>>>> From: kant kodali [mailto:[email protected]] > >>>>>>>>>>>>>>>>> Sent: Friday, April 14, 2017 9:53 AM > >>>>>>>>>>>>>>>>> To: [email protected] > >>>>>>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for > >> Spot-ingest > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> What is option C ? am I missing an email or > >> something? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha > >> Palayamkottai > >>> < > >>>>>>>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> +1 for Python 3.x > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I think that C is the strong solution, getting the > >>>>> ingest > >>>>>>>> really > >>>>>>>>>>>>>>>>>>> strong is going to lower barriers to adoption. > >> Doing > >>>> it > >>>>>>>>>>>>>>>>>>> in > >>>>>>>>> Python > >>>>>>>>>>>>>>>>>>> will open up the ingest portion of the project to > >>>>> include > >>>>>>>>>>>>>>>>>>> many > >>>>>>>>>>>>>> more > >>>>>>>>>>>>>>>>> developers. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Before it comes up I would like to throw the > >>> following > >>>>> on > >>>>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>> pile... > >>>>>>>>>>>>>>>>>>> Major > >>>>>>>>>>>>>>>>>>> python projects django/flash, others are dropping > >>> 2.x > >>>>>>>>>>>>>>>>>>> support > >>>>>>>> in > >>>>>>>>>>>>>>>>>>> releases scheduled in the next 6 to 8 months. > >> Hadoop > >>>>>>>>>>>>>>>>>>> projects > >>>>>>>> in > >>>>>>>>>>>>>>>>>>> general tend to lag in modern python support, lets > >>>>> please > >>>>>>>> build > >>>>>>>>>>>>>> this > >>>>>>>>>>>>>>>>>>> in 3.5 so that we don't have to immediately > >> expect a > >>>>>>>>>>>>>>>>>>> rebuild > >>>>>>>> in > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> pipeline. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> -Vote C > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks Nate > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Austin > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross > >>>>>>>>>>>>>>>>>>> <[email protected]> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I really like option C because it gives a lot of > >>>>>>>>>>>>>>>>>>> flexibility > >>>>>>>> for > >>>>>>>>>>>>>>>>>>> ingest > >>>>>>>>>>>>>>>>>>>> (python vs scala) but still has the robust spark > >>>>>>>>>>>>>>>>>>>> streaming > >>>>>>>>>>>>>> backend > >>>>>>>>>>>>>>>>>>>> for performance. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks for putting this together Nate. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Alan > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha > >>>> Palayamkottai < > >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I agree. We should continue making the existing > >>> stack > >>>>>>>>>>>>>>>>>>>> more > >>>>>>>>>>> mature > >>>>>>>>>>>>>>> at > >>>>>>>>>>>>>>>>>>>>> this point. Maybe if we have enough community > >>>> support > >>>>>>>>>>>>>>>>>>>>> we can > >>>>>>>>>>> add > >>>>>>>>>>>>>>>>>>>>> additional datastores. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Chokha. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote: > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Hi Kant, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If > >>> you're > >>>>>>>>>>>>>>>>>>>>>> using > >>>>>>>>>>>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, > >>>> Spot > >>>>>>>>>>>>>>>>>>>>>> is > >>>>>>>> based > >>>>>>>>>>>>>> on > >>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't > >> switch > >>>> too > >>>>>>>>>>>>>>>>>>>>>> many > >>>>>>>>>>>>>> pieces > >>>>>>>>>>>>>>>>> yet. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> In most Opensource projects you start relying > >> on > >>> a > >>>>>>>> well-known > >>>>>>>>>>>>>>>>>>>>>> stack and then you begin to support other DB > >>>> backends > >>>>>>>>>>>>>>>>>>>>>> once > >>>>>>>>>>> it's > >>>>>>>>>>>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps > >>> which > >>>>>>>>>>>>>>>>>>>>>> haven't > >>>>>>>>>>>>>> been > >>>>>>>>>>>>>>>>>>>>>> ported away from MySQL yet. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> In any case, you'll need a high performance > >> SQL + > >>>>>>>>>>>>>>>>>>>>>> Massive > >>>>>>>>>>>>>> Storage > >>>>>>>>>>>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... > >>> ATM, > >>>>>>>>>>>>>>>>>>>>>> + that > >>>>>>>> can > >>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>>>>> only provided by Hadoop. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Regards! > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Kenneth > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué: > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Hi Kenneth, > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Thanks for the response. I think you made a > >>> case > >>>>> for > >>>>>>>>>>>>>>>>>>>>>>> HDFS however users may want to use S3 or some > >>>> other > >>>>>>>>>>>>>>>>>>>>>>> FS in which > >>>>>>>>>>>>>> case > >>>>>>>>>>>>>>>>>>>>>>> they can use Auxilio (hoping that there are no > >>>>>>>>>>>>>>>>>>>>>>> changes > >>>>>>>>> needed > >>>>>>>>>>>>>>>>>>>>>>> within Spot in which case I > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> can > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> agree to that). for example, Netflix stores all > >>>> there > >>>>>>>>>>>>>>>>>>>>> data > >>>>>>>>> into > >>>>>>>>>>>>>> S3 > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> The distributed sql query engine I would say > >>>> should > >>>>>>>>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>> pluggable > >>>>>>>>>>>>>>>>>>>>>>> with whatever user may want to use and there a > >>>> bunch > >>>>>>>>>>>>>>>>>>>>>>> of > >>>>>>>> them > >>>>>>>>>>>>>> out > >>>>>>>>>>>>>>>>> there. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> sure > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Impala is better than hive but what if users are > >>>>>>>>>>>>>>>>>>>>> already > >>>>>>>> using > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> something > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> else like Drill or Presto? > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Me personally, would not assume that users are > >>>>>>>>>>>>>>>>>>>>>>> willing to > >>>>>>>>>>>>>> deploy > >>>>>>>>>>>>>>>>>>>>>>> all > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> that and make their existing stack more > >>> complicated > >>>> at > >>>>>>>>>>>>>>>>>>>>> very > >>>>>>>>>>>>>> least > >>>>>>>>>>>>>>> I > >>>>>>>>>>>>>>>>>>>>>>> would > >>>>>>>>>>>>>>>>>>>>>>> say it is a uphill battle. Things have been > >>>> changing > >>>>>>>> rapidly > >>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>>>>>>> Big > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> data > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> space so whatever we think is standard won't be > >>>>>>>>>>>>>>>>>>>>> standard > >>>>>>>>>>> anymore > >>>>>>>>>>>>>>>>>>>>> but > >>>>>>>>>>>>>>>>>>>>>>> importantly there shouldn't be any reason why > >> we > >>>>>>>>>>>>>>>>>>>>>>> shouldn't > >>>>>>>>> be > >>>>>>>>>>>>>>>>>>>>>>> flexible right. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Also I am not sure why only YARN? why not make > >>>> that > >>>>>>>>>>>>>>>>>>>>>>> also > >>>>>>>>> more > >>>>>>>>>>>>>>>>>>>>>>> flexible so users can pick Mesos or > >> standalone. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> I think Flexibility is a key for a wide > >> adoption > >>>>>>>>>>>>>>>>>>>>>>> rather > >>>>>>>> than > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> tightly > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> coupled architecture. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth > >> Peiruza > >>>>>>>>>>>>>>>>>>>>>>> <[email protected]> > >>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> PS: you need a big data platform to be able to > >>>>>>>>>>>>>>>>>>>>>>> collect all > >>>>>>>>>>>>>> those > >>>>>>>>>>>>>>>>>>>>>>>> netflows > >>>>>>>>>>>>>>>>>>>>>>>> and logs. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, > >>> then > >>>>> you > >>>>>>>>>>>>>>>>>>>>>>>> need > >>>>>>>>>>>>>> loads > >>>>>>>>>>>>>>>>>>>>>>>> of data to get ML working properly, and > >>> somewhere > >>>>> to > >>>>>>>>>>>>>>>>>>>>>>>> run > >>>>>>>>>>>>>> those > >>>>>>>>>>>>>>>>>>>>>>>> algorithms. That > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Hadoop. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Regards! > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Kenneth > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Sent from my Mi phone On kant kodali > >>>>>>>>>>>>>>>>>>>>>>>> <[email protected]>, Apr 14, 2017 4:04 > >>>>>>>> AM > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Thanks for starting this thread. Here is my > >>>>> feedback. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> I somehow think the architecture is too > >>>> complicated > >>>>>>>>>>>>>>>>>>>>>>>> for > >>>>>>>>> wide > >>>>>>>>>>>>>>>>>>>>>>>> adoption since it requires to install the > >>>>> following. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> HDFS. > >>>>>>>>>>>>>>>>>>>>>>>> HIVE. > >>>>>>>>>>>>>>>>>>>>>>>> IMPALA. > >>>>>>>>>>>>>>>>>>>>>>>> KAFKA. > >>>>>>>>>>>>>>>>>>>>>>>> SPARK (YARN). > >>>>>>>>>>>>>>>>>>>>>>>> YARN. > >>>>>>>>>>>>>>>>>>>>>>>> Zookeeper. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Currently there are way too many dependencies > >>>> that > >>>>>>>>>>>>>> discourages > >>>>>>>>>>>>>>>>>>>>>>>> lot of users from using it because they have > >> to > >>>> go > >>>>>>>> through > >>>>>>>>>>>>>>>>>>>>>>>> deployment of all that required software. I > >>> think > >>>>>>>>>>>>>>>>>>>>>>>> for > >>>>>>>> wide > >>>>>>>>>>>>>>>>>>>>>>>> option we should minimize the dependencies > >> and > >>>> have > >>>>>>>>>>>>>>>>>>>>>>>> more pluggable architecture. for example I am > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> not > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> sure why HIVE & IMPALA both are required? why > >> not > >>>> just > >>>>>>>>>>>>>>>>>>>>> use > >>>>>>>>>>> Spark > >>>>>>>>>>>>>>>>>>>>> SQL > >>>>>>>>>>>>>>>>>>>>>>>> since > >>>>>>>>>>>>>>>>>>>>>>>> its already dependency or say users may want > >> to > >>>> use > >>>>>>>>>>>>>>>>>>>>>>>> their > >>>>>>>>>>> own > >>>>>>>>>>>>>>>>>>>>>>>> distributed query engine they like such as > >>> Apache > >>>>>>>>>>>>>>>>>>>>>>>> Drill > >>>>>>>> or > >>>>>>>>>>>>>>>>>>>>>>>> something else. we should be flexible enough > >> to > >>>>>>>>>>>>>>>>>>>>>>>> provide > >>>>>>>>> that > >>>>>>>>>>>>>>>>>>>>>>>> option > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Also, I see that HDFS is used such that > >>>> collectors > >>>>>>>>>>>>>>>>>>>>>>>> can > >>>>>>>>>>>>>> receive > >>>>>>>>>>>>>>>>>>>>>>>> file path's through Kafka and be able to > >> read a > >>>>>>>>>>>>>>>>>>>>>>>> file. How > >>>>>>>>>>> big > >>>>>>>>>>>>>>>>>>>>>>>> are these files ? > >>>>>>>>>>>>>>>>>>>>>>>> Do we > >>>>>>>>>>>>>>>>>>>>>>>> really need HDFS for this? Why not provide > >> more > >>>>> ways > >>>>>>>>>>>>>>>>>>>>>>>> to > >>>>>>>>> send > >>>>>>>>>>>>>>>>>>>>>>>> data such as sending data directly through > >>> Kafka > >>>> or > >>>>>>>>>>>>>>>>>>>>>>>> say > >>>>>>>>> just > >>>>>>>>>>>>>>>>>>>>>>>> leaving up to the user to specify the file > >>>> location > >>>>>>>>>>>>>>>>>>>>>>>> as an argument to collector process > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow > >>> data > >>>>> one > >>>>>>>> would > >>>>>>>>>>>>>>>>>>>>>>>> require a specific hardware. This really > >> means > >>>>>>>>>>>>>>>>>>>>>>>> Apache > >>>>>>>> Spot > >>>>>>>>>>> is > >>>>>>>>>>>>>>>>>>>>>>>> not meant for everyone. > >>>>>>>>>>>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze > >>> the > >>>>>>>>>>>>>>>>>>>>>>>> network > >>>>>>>>>>>>>>> traffic > >>>>>>>>>>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> any > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> machine but if it requires a specific hard then > >> I > >>>>> think > >>>>>>>>>>>>>>>>>>>>> it > >>>>>>>> is > >>>>>>>>>>>>>>>>>>>>>>>> targeted for > >>>>>>>>>>>>>>>>>>>>>>>> specific group of people. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> The real strength of Apache Spot should > >> mainly > >>> be > >>>>>>>>>>>>>>>>>>>>>>>> just > >>>>>>>>>>>>>>> analyzing > >>>>>>>>>>>>>>>>>>>>>>>> network traffic through ML. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, > >>>> Nathan > >>>>> L > >>>>>>>>>>>>>>>>>>>>>>>> < [email protected]> wrote: > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Thanks, Nate, > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Nate. > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>>>>>>>>>>>>>>> From: Nate Smith [mailto: > >>> [email protected]] > >>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM > >>>>>>>>>>>>>>>>>>>>>>>>> To: [email protected] > >>>>>>>>>>>>>>>>>>>>>>>>> Cc: [email protected]; > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> [email protected] > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for > >>>> Spot-ingest > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh > >>> well > >>>> :) > >>>>>>>> Here’s > >>>>>>>>>>> an > >>>>>>>>>>>>>>>>>>>>>>>>> image form: > >>>>>>>>>>>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, > >> Nathan > >>>> L < > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> The diagram became garbled in the text > >>> format. > >>>>>>>>>>>>>>>>>>>>>>>>>> Could you resend it as a pdf? > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>>>>>>>>> Nate > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> -----Original Message----- > >>>>>>>>>>>>>>>>>>>>>>>>>> From: Nathanael Smith > >>>>>>>>>>>>>>>>>>>>>>>>>> [mailto:[email protected]] > >>>>>>>>>>>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM > >>>>>>>>>>>>>>>>>>>>>>>>>> To: [email protected]; > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]; > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> [email protected] > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for > >>>> Spot-ingest > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> How would you like to see Spot-ingest > >> change? > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> A. continue development on the Python > >>>>>>>>>>>>>>>>>>>>>>>>>> Master/Worker > >>>>>>>> with > >>>>>>>>>>>>>>> focus > >>>>>>>>>>>>>>>>>>>>>>>>>> on > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> performance / error handling / logging B. > >>>> Develop > >>>>>>>>>>>>>>>>>>>>>>>>> Scala > >>>>>>>>>>>>>> based > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> ingest to > >>>>>>>>>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA > >>> (UI > >>>>> to > >>>>>>>>>>> continue > >>>>>>>>>>>>>>>>>>>>>>>>> being > >>>>>>>>>>>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with > >> Scala > >>>>>>>>>>>>>>>>>>>>>>>>> based > >>>>>>>> Spark > >>>>>>>>>>>>>>> code > >>>>>>>>>>>>>>>>>>>>>>>>> for normalization and input into DB > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Including the high level diagram: > >>>>>>>>>>>>>>>>>>>>>>>>>> +----------------------------- > >>>>>>>>>>>>>> ------------------------------ > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> -------------------------------+ > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | +--------------------------+ > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | Master | A. B. C. > >>>>>>>>>>>>>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Worker | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | A. Python > >>> +---------------+ > >>>>>>> A. > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | A. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Python | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | B. Scala | > >>> | > >>>>>>>>>>>>>>>> +-------------> > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> +----+ | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | C. Python | > >>> | > >>>>> | > >>>>>>>>>>>>>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | +---^------+---------------+ > >>> | > >>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | | > >>> | > >>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | | > >>> | > >>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | +Note--------------+ > >>> | > >>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | |Running on a | > >>> | > >>>>> | > >>>>>>>>>>>>>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Spark > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Streaming | | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | |worker node in | > >>> | > >>>>> | > >>>>>>>>>>> B. > >>>>>>>>>>>>>>> C. > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | B. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Scala | | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | |the Hadoop cluster| > >>> | > >>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> +--------> C. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Scala +-+ | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | +------------------+ > >>> | > >>>>> | > >>>>>>>> | > >>>>>>>>>>>>>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | A.| > >>> | > >>>>> | > >>>>>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> +-----------------+ | | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | B.| > >>> | > >>>>> | > >>>>>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | C.| > >>> | > >>>>> | > >>>>>>>> | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | +----------------------+ > >>>>>>>>> +-v------+----+----+-+ > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> +--------------v--v-+ | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | | | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | Local FS: | | > >> hdfs > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> Hive / Impala | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | - Binary/Text | | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> - Parquet - | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | Log files - | | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | | | | > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> | | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> | +----------------------+ > >>>>>>>>> +--------------------+ > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> +-------------------+ | > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> +----------------------------- > >>>>>>>>>>>>>> ------------------------------ > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> -------------------------------+ > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> Please let me know your thoughts, > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> - Nathanael > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> Michael Ridley <[email protected]> > >>>>>>>>>>>>> office: (650) 352-1337 > >>>>>>>>>>>>> mobile: (571) 438-2420 > >>>>>>>>>>>>> Senior Solutions Architect > >>>>>>>>>>>>> Cloudera, Inc. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Michael Ridley <[email protected]> > >>>>>>>>>>> office: (650) 352-1337 > >>>>>>>>>>> mobile: (571) 438-2420 > >>>>>>>>>>> Senior Solutions Architect > >>>>>>>>>>> Cloudera, Inc. > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >> > >> > >> -- > >> Michael Ridley <[email protected]> > >> office: (650) 352-1337 > >> mobile: (571) 438-2420 > >> Senior Solutions Architect > >> Cloudera, Inc. > >> > >
