if your relational database has enough computing power, you don't have to
change it. You can just run SQL queries on top of it or even run Spark
queries over it. There is no hard-fast rule about using big data tools.
Usually people or organizations don't jump into big data for one specific
use case, it is a journey that involves multiple use cases, and future
growth and a lot more. If you data is already fitting in relational store
and you can used the existing SQL analytics and BI tool why consider other
options, unless you want to learn something new or data will grow over time
and you want to future proof it.

On Sun, Mar 6, 2016 at 12:59 PM, Krishna Sankar <ksanka...@gmail.com> wrote:

> Good question. It comes to computational complexity, computational scale
> and data volume.
>
>    1. If you can store the data in a single server or a small cluster of
>    db server (say mysql) then hdfs/Spark might be an overkill
>    2. If you can run the computation/process the data on a single machine
>    (remember servers with 512 GB memory and quad core CPUs can do a lot of
>    stuff) then Spark is an overkill
>    3. Even if you can do computations #1 & #2 above, in a pipeline and
>    tolerate the elapsed time, Spark might be an overkill
>    4. But if you require data/computation parallelism or distributed
>    processing of data due to computation complexities or data volume or time
>    constraints incl real time analytics, Spark is the right stack.
>    5. Taking a quick look at what you have described so far, probably
>    Spark is not needed.
>
> Cheers & HTH
> <k/>
>
> On Sun, Mar 6, 2016 at 9:17 AM, Laumegui Deaulobi <
> guillaume.bilod...@gmail.com> wrote:
>
>> Our problem space is survey analytics.  Each survey comprises a set of
>> questions, with each question having a set of possible answers.  Survey
>> fill-out tasks are sent to users, who have until a certain date to
>> complete
>> it.  Based on these survey fill-outs, reports need to be generated.  Each
>> report deals with a subset of the survey fill-outs, and comprises a set of
>> data points (average rating for question 1, min/max for question 2, etc.)
>>
>> We are dealing with rather large data sets - although reading the internet
>> we get the impression that everyone is analyzing petabytes of data...
>>
>> Users: up to 100,000
>> Surveys: up to 100,000
>> Questions per survey: up to 100
>> Possible answers per question: up to 10
>> Survey fill-outs / user: up to 10
>> Reports: up to 100,000
>> Data points per report: up to 100
>>
>> Data is currently stored in a relational database but a migration to a
>> different kind of store is possible.
>>
>> The naive algorithm for report generation can be summed up as this:
>>
>> for each report to be generated {
>>   for each report data point to be calculated {
>>     calculate data point
>>     add data point to report
>>   }
>>   publish report
>> }
>>
>> In order to deal with the upper limits of these values, we will need to
>> distribute this algorithm to a compute / data cluster as much as possible.
>>
>> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
>> HazelCast and several others, and am still confused as to how each of
>> these
>> can help us and how they fit together.
>>
>> Is Spark the right framework for us?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to