"AVRO for ingestion, Parquet for storage." This makes sense to me.
On Wed, Apr 19, 2017 at 1:45 PM, Michael Ridley <[email protected]> wrote: > One point I wanted to add the other day is that we probably do need to > write to Avro for the streaming ingest since Parquet doesn't do so great > with streaming ingest. But then we can convert from Avro to Parquet using > whatever tool (SparkSQL, Hive, whatever) for query. Whether the Avro is > persisted is an open question in my mind. > > Michael > > On Wed, Apr 19, 2017 at 4:22 PM, <[email protected]> wrote: > > > > > Replying to myself, AVRO for ingestion, Parquet for storage. > > > > Regards! > > > > Kenneth > > > > A 2017-04-19 22:05, Austin Leahy escrigué: > > > > I think Kafka is probably a red herring. It's an industry goto in the > >> application world because because of redundancy but the type and volumes > >> of > >> network telemetry that we are talking about here will bog kafka down > >> unless > >> you dedicate really serious hardware to just the kafka implementation. > >> It's > >> essentially the next level of the problem that the team was already > >> running > >> into when rabbitMQ was queueing in data. > >> > >> On Wed, Apr 19, 2017 at 12:33 PM Mark Grover <[email protected]> wrote: > >> > >> On Wed, Apr 19, 2017 at 10:19 AM, Smith, Nathanael P < > >>> [email protected]> wrote: > >>> > >>> > Mark, > >>> > > >>> > just digesting the below. > >>> > > >>> > Backing up in my thought process, I was thinking that the ingest > master > >>> > (first point of entry into the system) would want to put the data > into > >>> a > >>> > standard serializable format. I was thinking that libraries (such as > >>> > pyarrow in this case) could help by writing the data in parquet > format > >>> > early in the process. You are probably correct that at this point in > >>> time > >>> > it might not be worth the time and can be kept in the backlog. > >>> > That being said, I still think the master should produce data in a > >>> > standard format, what in your opinion (and I open this up of course > to > >>> > others) would be the most logical format? > >>> > the most basic would be to just keep it as a .csv. > >>> > > >>> > The master will likely write data to a staging directory in HDFS > where > >>> the > >>> > spark streaming job will pick it up for normalization/writing to > >>> parquet > >>> in > >>> > the correct block sizes and partitions. > >>> > > >>> > >>> Hi Nate, > >>> Avro is usually preferred for such a standard format - because it > >>> asserts a > >>> schema (types, etc.) which CSV doesn't and it allows for schema > evolution > >>> which depending on the type of evolution, CSV may or may not support. > >>> And, that's something I have seen being done very commonly. > >>> > >>> Now, if the data were in Kafka before it gets to master, one could > argue > >>> that the master could just send metadata to the workers (topic name, > >>> partition number, offset start and end) and the workers could read from > >>> Kafka directly. I do understand that'd be a much different architecture > >>> than the current one, but if you think it's a good idea too, we could > >>> document that, say in a JIRA, and (de-)prioritize it (and in line with > >>> the > >>> rest of the discussion on this thread, it's not the top-most priority). > >>> Thoughts? > >>> > >>> - Nathanael > >>> > > >>> > > >>> > > >>> > > On Apr 17, 2017, at 1:12 PM, Mark Grover <[email protected]> wrote: > >>> > > > >>> > > Thanks all your opinion. > >>> > > > >>> > > I think it's good to consider two things: > >>> > > 1. What do (we think) users care about? > >>> > > 2. What's the cost of changing things? > >>> > > > >>> > > About #1, I think users care more about what format data is written > >>> than > >>> > > how the data is written. I'd argue whether that uses Hive, MR, or a > >>> > custom > >>> > > Parquet writer is not as important to them as long as we maintain > >>> > > data/format compatibility. > >>> > > About #2, having worked on several projects, I find that it's > rather > >>> > > difficult to keep up with Parquet. Even in Spark, there are a few > >>> > different > >>> > > ways to write to Parquet - there's a regular mode, and a legacy > mode > >>> > > <https://github.com/apache/spark/blob/master/sql/core/ > >>> > src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ > >>> > ParquetWriteSupport.scala#L44> > >>> > > which > >>> > > continues to cause confusion > >>> > > <https://issues.apache.org/jira/browse/SPARK-20297> till date. > >>> Parquet > >>> > > itself is pretty dependent on Hadoop > >>> > > <https://github.com/Parquet/parquet-mr/search?l=Maven+POM& > >>> > q=hadoop&type=&utf8=%E2%9C%93> > >>> > > and, > >>> > > just integrating it with systems with a lot of developers (like > Spark > >>> > > <https://www.google.com/webhp?sourceid=chrome-instant&ion=1& > >>> > espv=2&ie=UTF-8#q=spark+parquet+jiras>) > >>> > > is still a lot of work. > >>> > > > >>> > > I personally think we should leverage higher level tools like Hive, > >>> or > >>> > > Spark to write data in widespread formats (Parquet, being a very > good > >>> > > example) but I personally wouldn't encourage us to manage the > writers > >>> > > ourselves. > >>> > > > >>> > > Thoughts? > >>> > > Mark > >>> > > > >>> > > On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley < > >>> [email protected] > >>> > > >>> > > wrote: > >>> > > > >>> > >> Without having given it too terribly much thought, that seems like > >>> an > >>> OK > >>> > >> approach. > >>> > >> > >>> > >> Michael > >>> > >> > >>> > >> On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith < > >>> [email protected]> > >>> > >> wrote: > >>> > >> > >>> > >>> I think the question is rather we can write the data generically > to > >>> > HDFS > >>> > >>> as parquet without the use of hive/impala? > >>> > >>> > >>> > >>> Today we write parquet data using the hive/mapreduce method. > >>> > >>> As part of the redesign i’d like to use libraries for this as > >>> opposed > >>> > to > >>> > >> a > >>> > >>> hadoop dependency. > >>> > >>> I think it would be preferred to use the python master to write > the > >>> > data > >>> > >>> into the format we want, then do normalization of the data in > spark > >>> > >>> streaming. > >>> > >>> Any thoughts? > >>> > >>> > >>> > >>> - Nathanael > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>>> On Apr 17, 2017, at 11:08 AM, Michael Ridley < > >>> [email protected]> > >>> > >>> wrote: > >>> > >>>> > >>> > >>>> I had thought that the plan was to write the data in Parquet in > >>> HDFS > >>> > >>>> ultimately. > >>> > >>>> > >>> > >>>> Michael > >>> > >>>> > >>> > >>>> On Sun, Apr 16, 2017 at 11:55 AM, kant kodali < > [email protected] > >>> > > >>> > >>> wrote: > >>> > >>>> > >>> > >>>>> Hi Mark, > >>> > >>>>> > >>> > >>>>> Thank you so much for hearing my argument. And I definetly > >>> understand > >>> > >>> that > >>> > >>>>> you guys have bunch of things to do. My only concern is that I > >>> hope > >>> > it > >>> > >>>>> doesn't take too long too support other backends. For example > >>> > @Kenneth > >>> > >>> had > >>> > >>>>> given an example of LAMP stack had not moved away from mysql > yet > >>> > which > >>> > >>>>> essentially means its probably a decade ? I see that in the > >>> current > >>> > >>>>> architecture the results from with python multiprocessing or > >>> Spark > >>> > >>>>> Streaming are written back to HDFS and If so, can we write > them > >>> in > >>> > >>> parquet > >>> > >>>>> format ? such that users should be able to plug in any query > >>> engine > >>> > >> but > >>> > >>>>> again I am not pushing you guys to do this right away or > anything > >>> > just > >>> > >>>>> seeing if there a way for me to get started in parallel and if > >>> not > >>> > >>>>> feasible, its fine I just wanted to share my 2 cents and I am > >>> glad > >>> my > >>> > >>>>> argument is heard! > >>> > >>>>> > >>> > >>>>> Thanks much! > >>> > >>>>> > >>> > >>>>> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> > >>> > wrote: > >>> > >>>>> > >>> > >>>>>> Hi Kant, > >>> > >>>>>> Just wanted to make sure you don't feel like we are ignoring > >>> your > >>> > >>>>>> comment:-) I hear you about pluggability. > >>> > >>>>>> > >>> > >>>>>> The design can and should be pluggable but the project has one > >>> stack > >>> > >> it > >>> > >>>>>> ships out of the box with, one stack that's the default stack > in > >>> the > >>> > >>>>> sense > >>> > >>>>>> that it's the most tested and so on. And, for us, that's our > >>> current > >>> > >>>>> stack. > >>> > >>>>>> If we were to take Apache Hive as an example, it shipped (and > >>> ships) > >>> > >>> with > >>> > >>>>>> MapReduce as the default configuration engine. At some point, > >>> Apache > >>> > >>> Tez > >>> > >>>>>> came along and wanted Hive to run on Tez, so they made a bunch > >>> of > >>> > >>> things > >>> > >>>>>> pluggable to run Hive on Tez (instead of the only option > >>> up-until > >>> > >> then: > >>> > >>>>>> Hive-on-MR) and then Apache Spark came and re-used some of > that > >>> > >>>>>> pluggability and even added some more so Hive-on-Spark could > >>> become > >>> > a > >>> > >>>>>> reality. In the same way, I don't think anyone disagrees here > >>> that > >>> > >>>>>> pluggabilty is a good thing but it's hard to do pluggability > >>> right, > >>> > >> and > >>> > >>>>> at > >>> > >>>>>> the right level, unless on has a clear use-case in mind. > >>> > >>>>>> > >>> > >>>>>> As a project, we have many things to do and I personally think > >>> the > >>> > >>>>> biggest > >>> > >>>>>> bang for the buck for us in making Spot a really solid and the > >>> best > >>> > >>> cyber > >>> > >>>>>> security solution isn't pluggability but the things we are > >>> working > >>> > on > >>> > >>> - a > >>> > >>>>>> better user interface, a common/unified approach to storing > and > >>> > >>> modeling > >>> > >>>>>> data, etc. > >>> > >>>>>> > >>> > >>>>>> Having said that, we are open, if it's important to you or > >>> someone > >>> > >>> else, > >>> > >>>>>> we'd be happy to receive and review those patches. > >>> > >>>>>> > >>> > >>>>>> Thanks! > >>> > >>>>>> Mark > >>> > >>>>>> > >>> > >>>>>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali < > >>> [email protected] > >>> > > >>> > >>>>> wrote: > >>> > >>>>>> > >>> > >>>>>>> Thanks Ross! and yes option C sounds good to me as well > >>> however I > >>> > >> just > >>> > >>>>>>> think Distributed Sql query engine and the resource manager > >>> should > >>> > >> be > >>> > >>>>>>> pluggable. > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D < > >>> > >> [email protected]> > >>> > >>>>>>> wrote: > >>> > >>>>>>> > >>> > >>>>>>>> Option C is to use python on the front end of ingest > pipeline > >>> and > >>> > >>>>>>>> spark/scala on the back end. > >>> > >>>>>>>> > >>> > >>>>>>>> Option A uses python workers on the backend > >>> > >>>>>>>> > >>> > >>>>>>>> Option B uses all scala. > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> -----Original Message----- > >>> > >>>>>>>> From: kant kodali [mailto:[email protected]] > >>> > >>>>>>>> Sent: Friday, April 14, 2017 9:53 AM > >>> > >>>>>>>> To: [email protected] > >>> > >>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > >>> > >>>>>>>> > >>> > >>>>>>>> What is option C ? am I missing an email or something? > >>> > >>>>>>>> > >>> > >>>>>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai < > >>> > >>>>>>>> [email protected]> wrote: > >>> > >>>>>>>> > >>> > >>>>>>>>> +1 for Python 3.x > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote: > >>> > >>>>>>>>> > >>> > >>>>>>>>>> I think that C is the strong solution, getting the ingest > >>> really > >>> > >>>>>>>>>> strong is going to lower barriers to adoption. Doing it in > >>> > Python > >>> > >>>>>>>>>> will open up the ingest portion of the project to include > >>> many > >>> > >>>>> more > >>> > >>>>>>>> developers. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> Before it comes up I would like to throw the following on > >>> the > >>> > >>>>>> pile... > >>> > >>>>>>>>>> Major > >>> > >>>>>>>>>> python projects django/flash, others are dropping 2.x > >>> support > >>> in > >>> > >>>>>>>>>> releases scheduled in the next 6 to 8 months. Hadoop > >>> projects > >>> in > >>> > >>>>>>>>>> general tend to lag in modern python support, lets please > >>> build > >>> > >>>>> this > >>> > >>>>>>>>>> in 3.5 so that we don't have to immediately expect a > rebuild > >>> in > >>> > >>>>> the > >>> > >>>>>>>>>> pipeline. > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> -Vote C > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> Thanks Nate > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> Austin > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross < > [email protected]> > >>> > >>>>> wrote: > >>> > >>>>>>>>>> > >>> > >>>>>>>>>> I really like option C because it gives a lot of > flexibility > >>> for > >>> > >>>>>>>>>> ingest > >>> > >>>>>>>>>>> (python vs scala) but still has the robust spark > streaming > >>> > >>>>> backend > >>> > >>>>>>>>>>> for performance. > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> Thanks for putting this together Nate. > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> Alan > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < > >>> > >>>>>>>>>>> [email protected]> wrote: > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>> I agree. We should continue making the existing stack > more > >>> > >> mature > >>> > >>>>>> at > >>> > >>>>>>>>>>>> this point. Maybe if we have enough community support we > >>> can > >>> > >> add > >>> > >>>>>>>>>>>> additional datastores. > >>> > >>>>>>>>>>>> > >>> > >>>>>>>>>>>> Chokha. > >>> > >>>>>>>>>>>> > >>> > >>>>>>>>>>>> > >>> > >>>>>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote: > >>> > >>>>>>>>>>>> > >>> > >>>>>>>>>>>>> Hi Kant, > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're > using > >>> > >>>>>>>>>>>>> Hive+Spark, then sure you'll have YARN. > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is > >>> based > >>> > >>>>> on > >>> > >>>>>> a > >>> > >>>>>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too > >>> many > >>> > >>>>> pieces > >>> > >>>>>>>> yet. > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> In most Opensource projects you start relying on a > >>> well-known > >>> > >>>>>>>>>>>>> stack and then you begin to support other DB backends > >>> once > >>> > >> it's > >>> > >>>>>>>>>>>>> quite mature. Think in the loads of LAMP apps which > >>> haven't > >>> > >>>>> been > >>> > >>>>>>>>>>>>> ported away from MySQL yet. > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> In any case, you'll need a high performance SQL + > Massive > >>> > >>>>> Storage > >>> > >>>>>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, > that > >>> can > >>> > >> be > >>> > >>>>>>>>>>>>> only provided by Hadoop. > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> Regards! > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> Kenneth > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué: > >>> > >>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> Hi Kenneth, > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> Thanks for the response. I think you made a case for > >>> HDFS > >>> > >>>>>>>>>>>>>> however users may want to use S3 or some other FS in > >>> which > >>> > >>>>> case > >>> > >>>>>>>>>>>>>> they can use Auxilio (hoping that there are no changes > >>> > needed > >>> > >>>>>>>>>>>>>> within Spot in which case I > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> can > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> agree to that). for example, Netflix stores all there > data > >>> > into > >>> > >>>>> S3 > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> The distributed sql query engine I would say should be > >>> > >>>>> pluggable > >>> > >>>>>>>>>>>>>> with whatever user may want to use and there a bunch > of > >>> them > >>> > >>>>> out > >>> > >>>>>>>> there. > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> sure > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> Impala is better than hive but what if users are already > >>> using > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> something > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> else like Drill or Presto? > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> Me personally, would not assume that users are willing > >>> to > >>> > >>>>> deploy > >>> > >>>>>>>>>>>>>> all > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> of > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> that and make their existing stack more complicated at > >>> very > >>> > >>>>> least > >>> > >>>>>> I > >>> > >>>>>>>>>>>>>> would > >>> > >>>>>>>>>>>>>> say it is a uphill battle. Things have been changing > >>> rapidly > >>> > >>>>> in > >>> > >>>>>>>>>>>>>> Big > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> data > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> space so whatever we think is standard won't be standard > >>> > >> anymore > >>> > >>>>>>>>>>>> but > >>> > >>>>>>>>>>>>>> importantly there shouldn't be any reason why we > >>> shouldn't > >>> > be > >>> > >>>>>>>>>>>>>> flexible right. > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> Also I am not sure why only YARN? why not make that > also > >>> > more > >>> > >>>>>>>>>>>>>> flexible so users can pick Mesos or standalone. > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> I think Flexibility is a key for a wide adoption > rather > >>> than > >>> > >>>>> the > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>> tightly > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> coupled architecture. > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> Thanks! > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza > >>> > >>>>>>>>>>>>>> <[email protected]> > >>> > >>>>>>>>>>>>>> wrote: > >>> > >>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> PS: you need a big data platform to be able to collect > >>> all > >>> > >>>>> those > >>> > >>>>>>>>>>>>>>> netflows > >>> > >>>>>>>>>>>>>>> and logs. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you > >>> need > >>> > >>>>> loads > >>> > >>>>>>>>>>>>>>> of data to get ML working properly, and somewhere to > >>> run > >>> > >>>>> those > >>> > >>>>>>>>>>>>>>> algorithms. That > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> is > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> Hadoop. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Regards! > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Kenneth > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Sent from my Mi phone > >>> > >>>>>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 > 4:04 > >>> AM > >>> > >>>>>> wrote: > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Hi, > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Thanks for starting this thread. Here is my feedback. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> I somehow think the architecture is too complicated > for > >>> > wide > >>> > >>>>>>>>>>>>>>> adoption since it requires to install the following. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> HDFS. > >>> > >>>>>>>>>>>>>>> HIVE. > >>> > >>>>>>>>>>>>>>> IMPALA. > >>> > >>>>>>>>>>>>>>> KAFKA. > >>> > >>>>>>>>>>>>>>> SPARK (YARN). > >>> > >>>>>>>>>>>>>>> YARN. > >>> > >>>>>>>>>>>>>>> Zookeeper. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Currently there are way too many dependencies that > >>> > >>>>> discourages > >>> > >>>>>>>>>>>>>>> lot of users from using it because they have to go > >>> through > >>> > >>>>>>>>>>>>>>> deployment of all that required software. I think for > >>> wide > >>> > >>>>>>>>>>>>>>> option we should minimize the dependencies and have > >>> more > >>> > >>>>>>>>>>>>>>> pluggable architecture. for example I am > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> not > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> sure why HIVE & IMPALA both are required? why not just > use > >>> > >> Spark > >>> > >>>>>>>>>>>> SQL > >>> > >>>>>>>>>>>>>>> since > >>> > >>>>>>>>>>>>>>> its already dependency or say users may want to use > >>> their > >>> > >> own > >>> > >>>>>>>>>>>>>>> distributed query engine they like such as Apache > Drill > >>> or > >>> > >>>>>>>>>>>>>>> something else. we should be flexible enough to > provide > >>> > that > >>> > >>>>>>>>>>>>>>> option > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Also, I see that HDFS is used such that collectors > can > >>> > >>>>> receive > >>> > >>>>>>>>>>>>>>> file path's through Kafka and be able to read a file. > >>> How > >>> > >> big > >>> > >>>>>>>>>>>>>>> are these files ? > >>> > >>>>>>>>>>>>>>> Do we > >>> > >>>>>>>>>>>>>>> really need HDFS for this? Why not provide more ways > to > >>> > send > >>> > >>>>>>>>>>>>>>> data such as sending data directly through Kafka or > say > >>> > just > >>> > >>>>>>>>>>>>>>> leaving up to the user to specify the file location > as > >>> an > >>> > >>>>>>>>>>>>>>> argument to collector process > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Finally, I learnt that to generate Net flow data one > >>> would > >>> > >>>>>>>>>>>>>>> require a specific hardware. This really means Apache > >>> Spot > >>> > >> is > >>> > >>>>>>>>>>>>>>> not meant for everyone. > >>> > >>>>>>>>>>>>>>> I thought Apache Spot can be used to analyze the > >>> network > >>> > >>>>>> traffic > >>> > >>>>>>>>>>>>>>> of > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>> any > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> machine but if it requires a specific hard then I think > it > >>> is > >>> > >>>>>>>>>>>>>>> targeted for > >>> > >>>>>>>>>>>>>>> specific group of people. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> The real strength of Apache Spot should mainly be > just > >>> > >>>>>> analyzing > >>> > >>>>>>>>>>>>>>> network traffic through ML. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Thanks! > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L > < > >>> > >>>>>>>>>>>>>>> [email protected]> wrote: > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> Thanks, Nate, > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> Nate. > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> -----Original Message----- > >>> > >>>>>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]] > >>> > >>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM > >>> > >>>>>>>>>>>>>>>> To: [email protected] > >>> > >>>>>>>>>>>>>>>> Cc: [email protected]; > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> [email protected] > >>> > >>>>>>>>>>> > >>> > >>>>>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> I was really hoping it came through ok, Oh well :) > >>> Here’s > >>> > >> an > >>> > >>>>>>>>>>>>>>>> image form: > >>> > >>>>>>>>>>>>>>>> http://imgur.com/a/DUDsD > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> [email protected]> wrote: > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> The diagram became garbled in the text format. > >>> > >>>>>>>>>>>>>>>>> Could you resend it as a pdf? > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> Thanks, > >>> > >>>>>>>>>>>>>>>>> Nate > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> -----Original Message----- > >>> > >>>>>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected] > ] > >>> > >>>>>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM > >>> > >>>>>>>>>>>>>>>>> To: [email protected]; > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> [email protected]; > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> [email protected] > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> How would you like to see Spot-ingest change? > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> A. continue development on the Python Master/Worker > >>> with > >>> > >>>>>> focus > >>> > >>>>>>>>>>>>>>>>> on > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> performance / error handling / logging B. Develop > >>> Scala > >>> > >>>>> based > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> ingest to > >>> > >>>>>>>>>>>>>>> be > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to > >>> > >> continue > >>> > >>>>>>>>>>>>>>>> being > >>> > >>>>>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based > >>> Spark > >>> > >>>>>> code > >>> > >>>>>>>>>>>>>>>> for normalization and input into DB > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> Including the high level diagram: > >>> > >>>>>>>>>>>>>>>>> +----------------------------- > >>> > >>>>> ------------------------------ > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> -------------------------------+ > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | +--------------------------+ > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> +-----------------+ | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | Master | A. B. C. > >>> > >>>>>>> | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> Worker | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | A. Python +---------------+ > >>> A. > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | A. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> Python | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | B. Scala | | > >>> > >>>>>>> +-------------> > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> +----+ | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | C. Python | | | > >>> > >>>>>>> | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | +---^------+---------------+ | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> +-----------------+ | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | | | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | | | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | +Note--------------+ | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> +-----------------+ | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | |Running on a | | | > >>> > >>>>>>> | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> Spark > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> Streaming | | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | |worker node in | | | > >>> > >> B. > >>> > >>>>>> C. > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | B. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> Scala | | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | |the Hadoop cluster| | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> +--------> C. > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> Scala +-+ | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | +------------------+ | | > >>> | > >>> > >>>>>>> | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | A.| | | > >>> | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> +-----------------+ | | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | B.| | | > >>> | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | C.| | | > >>> | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | +----------------------+ > >>> > +-v------+----+----+-+ > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> +--------------v--v-+ | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | Local FS: | | hdfs > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> Hive / Impala | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | - Binary/Text | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> - Parquet - | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | Log files - | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | | | | > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> | | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> | +----------------------+ > >>> > +--------------------+ > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> +-------------------+ | > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> +----------------------------- > >>> > >>>>> ------------------------------ > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> -------------------------------+ > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> Please let me know your thoughts, > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> - Nathanael > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>>>>> > >>> > >>>>>>>>>>>> > >>> > >>>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>> > >>> > >>>>>> > >>> > >>>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> -- > >>> > >>>> Michael Ridley <[email protected]> > >>> > >>>> office: (650) 352-1337 > >>> > >>>> mobile: (571) 438-2420 > >>> > >>>> Senior Solutions Architect > >>> > >>>> Cloudera, Inc. > >>> > >>> > >>> > >>> > >>> > >> > >>> > >> > >>> > >> -- > >>> > >> Michael Ridley <[email protected]> > >>> > >> office: (650) 352-1337 > >>> > >> mobile: (571) 438-2420 > >>> > >> Senior Solutions Architect > >>> > >> Cloudera, Inc. > >>> > >> > >>> > > >>> > > >>> > >>> > > > > > -- > Michael Ridley <[email protected]> > office: (650) 352-1337 > mobile: (571) 438-2420 > Senior Solutions Architect > Cloudera, Inc. >
