I agree. We should continue making the existing stack more mature at this point. Maybe if we have enough community support we can add additional datastores.
Chokha. On 4/14/17 11:10 AM, [email protected] wrote: > Hi Kant, > > > YARN is the standard scheduler in Hadoop. If you're using Hive+Spark, > then sure you'll have YARN. > > Haven't seen any HIVE on Mesos so far. As said, Spot is based on a > quite standard Hadoop stack and I wouldn't switch too many pieces yet. > > In most Opensource projects you start relying on a well-known stack > and then you begin to support other DB backends once it's quite > mature. Think in the loads of LAMP apps which haven't been ported away > from MySQL yet. > > In any case, you'll need a high performance SQL + Massive Storage + > Machine Learning + Massive Ingestion, and... ATM, that can be only > provided by Hadoop. > > Regards! > > Kenneth > > A 2017-04-14 12:56, kant kodali escrigué: >> Hi Kenneth, >> >> Thanks for the response. I think you made a case for HDFS however >> users >> may want to use S3 or some other FS in which case they can use Auxilio >> (hoping that there are no changes needed within Spot in which case I can >> agree to that). for example, Netflix stores all there data into S3 >> >> The distributed sql query engine I would say should be pluggable with >> whatever user may want to use and there a bunch of them out there. sure >> Impala is better than hive but what if users are already using something >> else like Drill or Presto? >> >> Me personally, would not assume that users are willing to deploy all of >> that and make their existing stack more complicated at very least I >> would >> say it is a uphill battle. Things have been changing rapidly in Big data >> space so whatever we think is standard won't be standard anymore but >> importantly there shouldn't be any reason why we shouldn't be flexible >> right. >> >> Also I am not sure why only YARN? why not make that also more >> flexible so >> users can pick Mesos or standalone. >> >> I think Flexibility is a key for a wide adoption rather than the tightly >> coupled architecture. >> >> Thanks! >> >> >> >> >> >> >> >> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza <[email protected]> >> wrote: >> >>> PS: you need a big data platform to be able to collect all those >>> netflows >>> and logs. >>> >>> Spot isn't intended for SMBs, that's clear, then you need loads of >>> data to >>> get ML working properly, and somewhere to run those algorithms. That is >>> Hadoop. >>> >>> Regards! >>> >>> Kenneth >>> >>> >>> >>> Sent from my Mi phone >>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM wrote: >>> >>> Hi, >>> >>> Thanks for starting this thread. Here is my feedback. >>> >>> I somehow think the architecture is too complicated for wide adoption >>> since >>> it requires to install the following. >>> >>> HDFS. >>> HIVE. >>> IMPALA. >>> KAFKA. >>> SPARK (YARN). >>> YARN. >>> Zookeeper. >>> >>> Currently there are way too many dependencies that discourages lot of >>> users >>> from using it because they have to go through deployment of all that >>> required software. I think for wide option we should minimize the >>> dependencies and have more pluggable architecture. for example I am not >>> sure why HIVE & IMPALA both are required? why not just use Spark SQL >>> since >>> its already dependency or say users may want to use their own >>> distributed >>> query engine they like such as Apache Drill or something else. we >>> should >>> be >>> flexible enough to provide that option >>> >>> Also, I see that HDFS is used such that collectors can receive file >>> path's >>> through Kafka and be able to read a file. How big are these files ? >>> Do we >>> really need HDFS for this? Why not provide more ways to send data >>> such as >>> sending data directly through Kafka or say just leaving up to the >>> user to >>> specify the file location as an argument to collector process >>> >>> Finally, I learnt that to generate Net flow data one would require a >>> specific hardware. This really means Apache Spot is not meant for >>> everyone. >>> I thought Apache Spot can be used to analyze the network traffic of any >>> machine but if it requires a specific hard then I think it is >>> targeted for >>> specific group of people. >>> >>> The real strength of Apache Spot should mainly be just analyzing >>> network >>> traffic through ML. >>> >>> Thanks! >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < >>> [email protected]> wrote: >>> >>> > Thanks, Nate, >>> > >>> > Nate. >>> > >>> > >>> > -----Original Message----- >>> > From: Nate Smith [mailto:[email protected]] >>> > Sent: Thursday, April 13, 2017 4:26 PM >>> > To: [email protected] >>> > Cc: [email protected]; [email protected] >>> > Subject: Re: [Discuss] - Future plans for Spot-ingest >>> > >>> > I was really hoping it came through ok, >>> > Oh well :) >>> > Here’s an image form: >>> > http://imgur.com/a/DUDsD >>> > >>> > >>> > > On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < >>> > [email protected]> wrote: >>> > > >>> > > The diagram became garbled in the text format. >>> > > Could you resend it as a pdf? >>> > > >>> > > Thanks, >>> > > Nate >>> > > >>> > > -----Original Message----- >>> > > From: Nathanael Smith [mailto:[email protected]] >>> > > Sent: Thursday, April 13, 2017 4:01 PM >>> > > To: [email protected]; >>> [email protected]; >>> > [email protected] >>> > > Subject: [Discuss] - Future plans for Spot-ingest >>> > > >>> > > How would you like to see Spot-ingest change? >>> > > >>> > > A. continue development on the Python Master/Worker with focus on >>> > performance / error handling / logging B. Develop Scala based >>> ingest to >>> be >>> > inline with code base from ingest, ml, to OA (UI to continue being >>> > ipython/JS) C. Python ingest Worker with Scala based Spark code for >>> > normalization and input into DB >>> > > >>> > > Including the high level diagram: >>> > > +----------------------------------------------------------- >>> > -------------------------------+ >>> > > | +--------------------------+ >>> > +-----------------+ | >>> > > | | Master | A. B. C. | >>> > Worker | | >>> > > | | A. Python +---------------+ A. >>> | A. >>> > Python | | >>> > > | | B. Scala | | +-------------> >>> > +----+ | >>> > > | | C. Python | | | | >>> > | | | >>> > > | +---^------+---------------+ | | >>> > +-----------------+ | | >>> > > | | | | | >>> > | | >>> > > | | | | | >>> > | | >>> > > | | +Note--------------+ | | >>> > +-----------------+ | | >>> > > | | |Running on a | | | | >>> Spark >>> > Streaming | | | >>> > > | | |worker node in | | | B. C. >>> | B. >>> > Scala | | | >>> > > | | |the Hadoop cluster| | | >>> +--------> C. >>> > Scala +-+ | | >>> > > | | +------------------+ | | | | >>> > | | | | >>> > > | A.| | | | >>> > +-----------------+ | | | >>> > > | B.| | | | >>> > | | | >>> > > | C.| | | | >>> > | | | >>> > > | +----------------------+ +-v------+----+----+-+ >>> > +--------------v--v-+ | >>> > > | | | | >>> | | >>> > | | >>> > > | | Local FS: | | hdfs >>> | | >>> > Hive / Impala | | >>> > > | | - Binary/Text | | >>> | | >>> > - Parquet - | | >>> > > | | Log files - | | >>> | | >>> > | | >>> > > | | | | >>> | | >>> > | | >>> > > | +----------------------+ +--------------------+ >>> > +-------------------+ | >>> > > +----------------------------------------------------------- >>> > -------------------------------+ >>> > > >>> > > Please let me know your thoughts, >>> > > >>> > > - Nathanael >>> > > >>> > > >>> > > >>> > >>> > >>> >>> >
