Re: Is Spark right for us?

2016-03-07 Thread Laumegui Deaulobi
Thanks for your input.  That 1 hour per data point actually be a problem,
since sometimes we have reports with 100s of data points and need to
generate 100,000 reports.  So we definitely need to distribute this, but I
don't know where to start with this unfortunately.

On Mon, Mar 7, 2016 at 2:42 PM, Anurag [via Apache Spark User List] <
ml-node+s1001560n26421...@n3.nabble.com> wrote:

> Definition - each answer by an user is an event (I suppose)
>
> Let's estimate the number of events that can happen in a day in your case.
>
> 1. Max of Survey fill-outs / user = 10 = x
> 2. Max of Questions per survey = 100 = y
> 3. Max of users = 100,000 = z
>
>
> Maximum answers received in a day = x * y * z = 100,000,000 = 100 million
>
> Assuming you use a single c3.2xlarge machine,
> each data point in the report will get calculated in less than 1 hour
> (telling from my personal experience)
>
> I guess that would help.
>
> Regards
> Anurag
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412p26421.html
> To unsubscribe from Is Spark right for us?, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412p26422.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Is Spark right for us?

2016-03-06 Thread Laumegui Deaulobi
Our problem space is survey analytics.  Each survey comprises a set of
questions, with each question having a set of possible answers.  Survey
fill-out tasks are sent to users, who have until a certain date to complete
it.  Based on these survey fill-outs, reports need to be generated.  Each
report deals with a subset of the survey fill-outs, and comprises a set of
data points (average rating for question 1, min/max for question 2, etc.)

We are dealing with rather large data sets - although reading the internet
we get the impression that everyone is analyzing petabytes of data...

Users: up to 100,000
Surveys: up to 100,000
Questions per survey: up to 100
Possible answers per question: up to 10
Survey fill-outs / user: up to 10
Reports: up to 100,000
Data points per report: up to 100

Data is currently stored in a relational database but a migration to a
different kind of store is possible.

The naive algorithm for report generation can be summed up as this:

for each report to be generated {
  for each report data point to be calculated {
calculate data point
add data point to report
  }
  publish report
}

In order to deal with the upper limits of these values, we will need to
distribute this algorithm to a compute / data cluster as much as possible.

I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
HazelCast and several others, and am still confused as to how each of these
can help us and how they fit together.

Is Spark the right framework for us?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org