Good question. It comes to computational complexity, computational scale
and data volume.

   1. If you can store the data in a single server or a small cluster of db
   server (say mysql) then hdfs/Spark might be an overkill
   2. If you can run the computation/process the data on a single machine
   (remember servers with 512 GB memory and quad core CPUs can do a lot of
   stuff) then Spark is an overkill
   3. Even if you can do computations #1 & #2 above, in a pipeline and
   tolerate the elapsed time, Spark might be an overkill
   4. But if you require data/computation parallelism or distributed
   processing of data due to computation complexities or data volume or time
   constraints incl real time analytics, Spark is the right stack.
   5. Taking a quick look at what you have described so far, probably Spark
   is not needed.

Cheers & HTH
<k/>

On Sun, Mar 6, 2016 at 9:17 AM, Laumegui Deaulobi <
guillaume.bilod...@gmail.com> wrote:

> Our problem space is survey analytics.  Each survey comprises a set of
> questions, with each question having a set of possible answers.  Survey
> fill-out tasks are sent to users, who have until a certain date to complete
> it.  Based on these survey fill-outs, reports need to be generated.  Each
> report deals with a subset of the survey fill-outs, and comprises a set of
> data points (average rating for question 1, min/max for question 2, etc.)
>
> We are dealing with rather large data sets - although reading the internet
> we get the impression that everyone is analyzing petabytes of data...
>
> Users: up to 100,000
> Surveys: up to 100,000
> Questions per survey: up to 100
> Possible answers per question: up to 10
> Survey fill-outs / user: up to 10
> Reports: up to 100,000
> Data points per report: up to 100
>
> Data is currently stored in a relational database but a migration to a
> different kind of store is possible.
>
> The naive algorithm for report generation can be summed up as this:
>
> for each report to be generated {
>   for each report data point to be calculated {
>     calculate data point
>     add data point to report
>   }
>   publish report
> }
>
> In order to deal with the upper limits of these values, we will need to
> distribute this algorithm to a compute / data cluster as much as possible.
>
> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
> HazelCast and several others, and am still confused as to how each of these
> can help us and how they fit together.
>
> Is Spark the right framework for us?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to