Thanks all your opinion. I think it's good to consider two things: 1. What do (we think) users care about? 2. What's the cost of changing things?
About #1, I think users care more about what format data is written than how the data is written. I'd argue whether that uses Hive, MR, or a custom Parquet writer is not as important to them as long as we maintain data/format compatibility. About #2, having worked on several projects, I find that it's rather difficult to keep up with Parquet. Even in Spark, there are a few different ways to write to Parquet - there's a regular mode, and a legacy mode <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L44> which continues to cause confusion <https://issues.apache.org/jira/browse/SPARK-20297> till date. Parquet itself is pretty dependent on Hadoop <https://github.com/Parquet/parquet-mr/search?l=Maven+POM&q=hadoop&type=&utf8=%E2%9C%93> and, just integrating it with systems with a lot of developers (like Spark <https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=spark+parquet+jiras>) is still a lot of work. I personally think we should leverage higher level tools like Hive, or Spark to write data in widespread formats (Parquet, being a very good example) but I personally wouldn't encourage us to manage the writers ourselves. Thoughts? Mark On Mon, Apr 17, 2017 at 11:44 AM, Michael Ridley <[email protected]> wrote: > Without having given it too terribly much thought, that seems like an OK > approach. > > Michael > > On Mon, Apr 17, 2017 at 2:33 PM, Nathanael Smith <[email protected]> > wrote: > > > I think the question is rather we can write the data generically to HDFS > > as parquet without the use of hive/impala? > > > > Today we write parquet data using the hive/mapreduce method. > > As part of the redesign i’d like to use libraries for this as opposed to > a > > hadoop dependency. > > I think it would be preferred to use the python master to write the data > > into the format we want, then do normalization of the data in spark > > streaming. > > Any thoughts? > > > > - Nathanael > > > > > > > > > On Apr 17, 2017, at 11:08 AM, Michael Ridley <[email protected]> > > wrote: > > > > > > I had thought that the plan was to write the data in Parquet in HDFS > > > ultimately. > > > > > > Michael > > > > > > On Sun, Apr 16, 2017 at 11:55 AM, kant kodali <[email protected]> > > wrote: > > > > > >> Hi Mark, > > >> > > >> Thank you so much for hearing my argument. And I definetly understand > > that > > >> you guys have bunch of things to do. My only concern is that I hope it > > >> doesn't take too long too support other backends. For example @Kenneth > > had > > >> given an example of LAMP stack had not moved away from mysql yet which > > >> essentially means its probably a decade ? I see that in the current > > >> architecture the results from with python multiprocessing or Spark > > >> Streaming are written back to HDFS and If so, can we write them in > > parquet > > >> format ? such that users should be able to plug in any query engine > but > > >> again I am not pushing you guys to do this right away or anything just > > >> seeing if there a way for me to get started in parallel and if not > > >> feasible, its fine I just wanted to share my 2 cents and I am glad my > > >> argument is heard! > > >> > > >> Thanks much! > > >> > > >> On Fri, Apr 14, 2017 at 1:38 PM, Mark Grover <[email protected]> wrote: > > >> > > >>> Hi Kant, > > >>> Just wanted to make sure you don't feel like we are ignoring your > > >>> comment:-) I hear you about pluggability. > > >>> > > >>> The design can and should be pluggable but the project has one stack > it > > >>> ships out of the box with, one stack that's the default stack in the > > >> sense > > >>> that it's the most tested and so on. And, for us, that's our current > > >> stack. > > >>> If we were to take Apache Hive as an example, it shipped (and ships) > > with > > >>> MapReduce as the default configuration engine. At some point, Apache > > Tez > > >>> came along and wanted Hive to run on Tez, so they made a bunch of > > things > > >>> pluggable to run Hive on Tez (instead of the only option up-until > then: > > >>> Hive-on-MR) and then Apache Spark came and re-used some of that > > >>> pluggability and even added some more so Hive-on-Spark could become a > > >>> reality. In the same way, I don't think anyone disagrees here that > > >>> pluggabilty is a good thing but it's hard to do pluggability right, > and > > >> at > > >>> the right level, unless on has a clear use-case in mind. > > >>> > > >>> As a project, we have many things to do and I personally think the > > >> biggest > > >>> bang for the buck for us in making Spot a really solid and the best > > cyber > > >>> security solution isn't pluggability but the things we are working on > > - a > > >>> better user interface, a common/unified approach to storing and > > modeling > > >>> data, etc. > > >>> > > >>> Having said that, we are open, if it's important to you or someone > > else, > > >>> we'd be happy to receive and review those patches. > > >>> > > >>> Thanks! > > >>> Mark > > >>> > > >>> On Fri, Apr 14, 2017 at 10:14 AM, kant kodali <[email protected]> > > >> wrote: > > >>> > > >>>> Thanks Ross! and yes option C sounds good to me as well however I > just > > >>>> think Distributed Sql query engine and the resource manager should > be > > >>>> pluggable. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> On Fri, Apr 14, 2017 at 9:55 AM, Ross, Alan D < > [email protected]> > > >>>> wrote: > > >>>> > > >>>>> Option C is to use python on the front end of ingest pipeline and > > >>>>> spark/scala on the back end. > > >>>>> > > >>>>> Option A uses python workers on the backend > > >>>>> > > >>>>> Option B uses all scala. > > >>>>> > > >>>>> > > >>>>> > > >>>>> -----Original Message----- > > >>>>> From: kant kodali [mailto:[email protected]] > > >>>>> Sent: Friday, April 14, 2017 9:53 AM > > >>>>> To: [email protected] > > >>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > > >>>>> > > >>>>> What is option C ? am I missing an email or something? > > >>>>> > > >>>>> On Fri, Apr 14, 2017 at 9:15 AM, Chokha Palayamkottai < > > >>>>> [email protected]> wrote: > > >>>>> > > >>>>>> +1 for Python 3.x > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> On 4/14/2017 11:59 AM, Austin Leahy wrote: > > >>>>>> > > >>>>>>> I think that C is the strong solution, getting the ingest really > > >>>>>>> strong is going to lower barriers to adoption. Doing it in Python > > >>>>>>> will open up the ingest portion of the project to include many > > >> more > > >>>>> developers. > > >>>>>>> > > >>>>>>> Before it comes up I would like to throw the following on the > > >>> pile... > > >>>>>>> Major > > >>>>>>> python projects django/flash, others are dropping 2.x support in > > >>>>>>> releases scheduled in the next 6 to 8 months. Hadoop projects in > > >>>>>>> general tend to lag in modern python support, lets please build > > >> this > > >>>>>>> in 3.5 so that we don't have to immediately expect a rebuild in > > >> the > > >>>>>>> pipeline. > > >>>>>>> > > >>>>>>> -Vote C > > >>>>>>> > > >>>>>>> Thanks Nate > > >>>>>>> > > >>>>>>> Austin > > >>>>>>> > > >>>>>>> On Fri, Apr 14, 2017 at 8:52 AM Alan Ross <[email protected]> > > >> wrote: > > >>>>>>> > > >>>>>>> I really like option C because it gives a lot of flexibility for > > >>>>>>> ingest > > >>>>>>>> (python vs scala) but still has the robust spark streaming > > >> backend > > >>>>>>>> for performance. > > >>>>>>>> > > >>>>>>>> Thanks for putting this together Nate. > > >>>>>>>> > > >>>>>>>> Alan > > >>>>>>>> > > >>>>>>>> On Fri, Apr 14, 2017 at 8:44 AM, Chokha Palayamkottai < > > >>>>>>>> [email protected]> wrote: > > >>>>>>>> > > >>>>>>>> I agree. We should continue making the existing stack more > mature > > >>> at > > >>>>>>>>> this point. Maybe if we have enough community support we can > add > > >>>>>>>>> additional datastores. > > >>>>>>>>> > > >>>>>>>>> Chokha. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On 4/14/17 11:10 AM, [email protected] wrote: > > >>>>>>>>> > > >>>>>>>>>> Hi Kant, > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> YARN is the standard scheduler in Hadoop. If you're using > > >>>>>>>>>> Hive+Spark, then sure you'll have YARN. > > >>>>>>>>>> > > >>>>>>>>>> Haven't seen any HIVE on Mesos so far. As said, Spot is based > > >> on > > >>> a > > >>>>>>>>>> quite standard Hadoop stack and I wouldn't switch too many > > >> pieces > > >>>>> yet. > > >>>>>>>>>> > > >>>>>>>>>> In most Opensource projects you start relying on a well-known > > >>>>>>>>>> stack and then you begin to support other DB backends once > it's > > >>>>>>>>>> quite mature. Think in the loads of LAMP apps which haven't > > >> been > > >>>>>>>>>> ported away from MySQL yet. > > >>>>>>>>>> > > >>>>>>>>>> In any case, you'll need a high performance SQL + Massive > > >> Storage > > >>>>>>>>>> + Machine Learning + Massive Ingestion, and... ATM, that can > be > > >>>>>>>>>> only provided by Hadoop. > > >>>>>>>>>> > > >>>>>>>>>> Regards! > > >>>>>>>>>> > > >>>>>>>>>> Kenneth > > >>>>>>>>>> > > >>>>>>>>>> A 2017-04-14 12:56, kant kodali escrigué: > > >>>>>>>>>> > > >>>>>>>>>>> Hi Kenneth, > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks for the response. I think you made a case for HDFS > > >>>>>>>>>>> however users may want to use S3 or some other FS in which > > >> case > > >>>>>>>>>>> they can use Auxilio (hoping that there are no changes needed > > >>>>>>>>>>> within Spot in which case I > > >>>>>>>>>>> > > >>>>>>>>>> can > > >>>>>>>> > > >>>>>>>>> agree to that). for example, Netflix stores all there data into > > >> S3 > > >>>>>>>>>>> > > >>>>>>>>>>> The distributed sql query engine I would say should be > > >> pluggable > > >>>>>>>>>>> with whatever user may want to use and there a bunch of them > > >> out > > >>>>> there. > > >>>>>>>>>>> > > >>>>>>>>>> sure > > >>>>>>>> > > >>>>>>>>> Impala is better than hive but what if users are already using > > >>>>>>>>>>> > > >>>>>>>>>> something > > >>>>>>>> > > >>>>>>>>> else like Drill or Presto? > > >>>>>>>>>>> > > >>>>>>>>>>> Me personally, would not assume that users are willing to > > >> deploy > > >>>>>>>>>>> all > > >>>>>>>>>>> > > >>>>>>>>>> of > > >>>>>>>> > > >>>>>>>>> that and make their existing stack more complicated at very > > >> least > > >>> I > > >>>>>>>>>>> would > > >>>>>>>>>>> say it is a uphill battle. Things have been changing rapidly > > >> in > > >>>>>>>>>>> Big > > >>>>>>>>>>> > > >>>>>>>>>> data > > >>>>>>>> > > >>>>>>>>> space so whatever we think is standard won't be standard > anymore > > >>>>>>>>> but > > >>>>>>>>>>> importantly there shouldn't be any reason why we shouldn't be > > >>>>>>>>>>> flexible right. > > >>>>>>>>>>> > > >>>>>>>>>>> Also I am not sure why only YARN? why not make that also more > > >>>>>>>>>>> flexible so users can pick Mesos or standalone. > > >>>>>>>>>>> > > >>>>>>>>>>> I think Flexibility is a key for a wide adoption rather than > > >> the > > >>>>>>>>>>> > > >>>>>>>>>> tightly > > >>>>>>>> > > >>>>>>>>> coupled architecture. > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks! > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Fri, Apr 14, 2017 at 3:12 AM, Kenneth Peiruza > > >>>>>>>>>>> <[email protected]> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> PS: you need a big data platform to be able to collect all > > >> those > > >>>>>>>>>>>> netflows > > >>>>>>>>>>>> and logs. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Spot isn't intended for SMBs, that's clear, then you need > > >> loads > > >>>>>>>>>>>> of data to get ML working properly, and somewhere to run > > >> those > > >>>>>>>>>>>> algorithms. That > > >>>>>>>>>>>> > > >>>>>>>>>>> is > > >>>>>>>> > > >>>>>>>>> Hadoop. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Regards! > > >>>>>>>>>>>> > > >>>>>>>>>>>> Kenneth > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> Sent from my Mi phone > > >>>>>>>>>>>> On kant kodali <[email protected]>, Apr 14, 2017 4:04 AM > > >>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> Hi, > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks for starting this thread. Here is my feedback. > > >>>>>>>>>>>> > > >>>>>>>>>>>> I somehow think the architecture is too complicated for wide > > >>>>>>>>>>>> adoption since it requires to install the following. > > >>>>>>>>>>>> > > >>>>>>>>>>>> HDFS. > > >>>>>>>>>>>> HIVE. > > >>>>>>>>>>>> IMPALA. > > >>>>>>>>>>>> KAFKA. > > >>>>>>>>>>>> SPARK (YARN). > > >>>>>>>>>>>> YARN. > > >>>>>>>>>>>> Zookeeper. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Currently there are way too many dependencies that > > >> discourages > > >>>>>>>>>>>> lot of users from using it because they have to go through > > >>>>>>>>>>>> deployment of all that required software. I think for wide > > >>>>>>>>>>>> option we should minimize the dependencies and have more > > >>>>>>>>>>>> pluggable architecture. for example I am > > >>>>>>>>>>>> > > >>>>>>>>>>> not > > >>>>>>>> > > >>>>>>>>> sure why HIVE & IMPALA both are required? why not just use > Spark > > >>>>>>>>> SQL > > >>>>>>>>>>>> since > > >>>>>>>>>>>> its already dependency or say users may want to use their > own > > >>>>>>>>>>>> distributed query engine they like such as Apache Drill or > > >>>>>>>>>>>> something else. we should be flexible enough to provide that > > >>>>>>>>>>>> option > > >>>>>>>>>>>> > > >>>>>>>>>>>> Also, I see that HDFS is used such that collectors can > > >> receive > > >>>>>>>>>>>> file path's through Kafka and be able to read a file. How > big > > >>>>>>>>>>>> are these files ? > > >>>>>>>>>>>> Do we > > >>>>>>>>>>>> really need HDFS for this? Why not provide more ways to send > > >>>>>>>>>>>> data such as sending data directly through Kafka or say just > > >>>>>>>>>>>> leaving up to the user to specify the file location as an > > >>>>>>>>>>>> argument to collector process > > >>>>>>>>>>>> > > >>>>>>>>>>>> Finally, I learnt that to generate Net flow data one would > > >>>>>>>>>>>> require a specific hardware. This really means Apache Spot > is > > >>>>>>>>>>>> not meant for everyone. > > >>>>>>>>>>>> I thought Apache Spot can be used to analyze the network > > >>> traffic > > >>>>>>>>>>>> of > > >>>>>>>>>>>> > > >>>>>>>>>>> any > > >>>>>>>> > > >>>>>>>>> machine but if it requires a specific hard then I think it is > > >>>>>>>>>>>> targeted for > > >>>>>>>>>>>> specific group of people. > > >>>>>>>>>>>> > > >>>>>>>>>>>> The real strength of Apache Spot should mainly be just > > >>> analyzing > > >>>>>>>>>>>> network traffic through ML. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks! > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Thu, Apr 13, 2017 at 4:28 PM, Segerlind, Nathan L < > > >>>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thanks, Nate, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Nate. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> -----Original Message----- > > >>>>>>>>>>>>> From: Nate Smith [mailto:[email protected]] > > >>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:26 PM > > >>>>>>>>>>>>> To: [email protected] > > >>>>>>>>>>>>> Cc: [email protected]; > > >>>>>>>>>>>>> > > >>>>>>>>>>>> [email protected] > > >>>>>>>> > > >>>>>>>>> Subject: Re: [Discuss] - Future plans for Spot-ingest > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> I was really hoping it came through ok, Oh well :) Here’s > an > > >>>>>>>>>>>>> image form: > > >>>>>>>>>>>>> http://imgur.com/a/DUDsD > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Apr 13, 2017, at 4:05 PM, Segerlind, Nathan L < > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> [email protected]> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> The diagram became garbled in the text format. > > >>>>>>>>>>>>>> Could you resend it as a pdf? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>> Nate > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> -----Original Message----- > > >>>>>>>>>>>>>> From: Nathanael Smith [mailto:[email protected]] > > >>>>>>>>>>>>>> Sent: Thursday, April 13, 2017 4:01 PM > > >>>>>>>>>>>>>> To: [email protected]; > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> [email protected]; > > >>>>>>>>>>>> > > >>>>>>>>>>>>> [email protected] > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Subject: [Discuss] - Future plans for Spot-ingest > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> How would you like to see Spot-ingest change? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> A. continue development on the Python Master/Worker with > > >>> focus > > >>>>>>>>>>>>>> on > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> performance / error handling / logging B. Develop Scala > > >> based > > >>>>>>>>>>>>> > > >>>>>>>>>>>> ingest to > > >>>>>>>>>>>> be > > >>>>>>>>>>>> > > >>>>>>>>>>>>> inline with code base from ingest, ml, to OA (UI to > continue > > >>>>>>>>>>>>> being > > >>>>>>>>>>>>> ipython/JS) C. Python ingest Worker with Scala based Spark > > >>> code > > >>>>>>>>>>>>> for normalization and input into DB > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Including the high level diagram: > > >>>>>>>>>>>>>> +----------------------------- > > >> ------------------------------ > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> -------------------------------+ > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | +--------------------------+ > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> +-----------------+ | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | Master | A. B. C. > > >>>> | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> Worker | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | A. Python +---------------+ A. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | A. > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Python | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | B. Scala | | > > >>>> +-------------> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> +----+ | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | C. Python | | | > > >>>> | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | +---^------+---------------+ | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> +-----------------+ | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | | | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | | | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | +Note--------------+ | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> +-----------------+ | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | |Running on a | | | > > >>>> | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> Spark > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Streaming | | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | |worker node in | | | > B. > > >>> C. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | B. > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Scala | | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | |the Hadoop cluster| | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> +--------> C. > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Scala +-+ | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | +------------------+ | | | > > >>>> | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | A.| | | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> +-----------------+ | | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | B.| | | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | C.| | | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | +----------------------+ +-v------+----+----+-+ > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> +--------------v--v-+ | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | Local FS: | | hdfs > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Hive / Impala | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | - Binary/Text | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>> > > >>>>>>>>>>>>> - Parquet - | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | Log files - | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | | | | > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>> > > >>>>>>>>>>>>> | | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> | +----------------------+ +--------------------+ > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> +-------------------+ | > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> +----------------------------- > > >> ------------------------------ > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> -------------------------------+ > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Please let me know your thoughts, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> - Nathanael > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > > > > > > > > -- > > > Michael Ridley <[email protected]> > > > office: (650) 352-1337 > > > mobile: (571) 438-2420 > > > Senior Solutions Architect > > > Cloudera, Inc. > > > > > > > -- > Michael Ridley <[email protected]> > office: (650) 352-1337 > mobile: (571) 438-2420 > Senior Solutions Architect > Cloudera, Inc. >
