Re: Is Spark right for us?

2016-03-07 Thread Laumegui Deaulobi
Thanks for your input.  That 1 hour per data point actually be a problem,
since sometimes we have reports with 100s of data points and need to
generate 100,000 reports.  So we definitely need to distribute this, but I
don't know where to start with this unfortunately.

On Mon, Mar 7, 2016 at 2:42 PM, Anurag [via Apache Spark User List] <
ml-node+s1001560n26421...@n3.nabble.com> wrote:

> Definition - each answer by an user is an event (I suppose)
>
> Let's estimate the number of events that can happen in a day in your case.
>
> 1. Max of Survey fill-outs / user = 10 = x
> 2. Max of Questions per survey = 100 = y
> 3. Max of users = 100,000 = z
>
>
> Maximum answers received in a day = x * y * z = 100,000,000 = 100 million
>
> Assuming you use a single c3.2xlarge machine,
> each data point in the report will get calculated in less than 1 hour
> (telling from my personal experience)
>
> I guess that would help.
>
> Regards
> Anurag
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412p26421.html
> To unsubscribe from Is Spark right for us?, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=26412=Z3VpbGxhdW1lLmJpbG9kZWF1QGdtYWlsLmNvbXwyNjQxMnw2MTU2NjY2NjE=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412p26422.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is Spark right for us?

2016-03-07 Thread Mich Talebzadeh
h reading the
>>>> internet
>>>> we get the impression that everyone is analyzing petabytes of data...
>>>>
>>>> Users: up to 100,000
>>>> Surveys: up to 100,000
>>>> Questions per survey: up to 100
>>>> Possible answers per question: up to 10
>>>> Survey fill-outs / user: up to 10
>>>> Reports: up to 100,000
>>>> Data points per report: up to 100
>>>>
>>>> Data is currently stored in a relational database but a migration to a
>>>> different kind of store is possible.
>>>>
>>>> The naive algorithm for report generation can be summed up as this:
>>>>
>>>> for each report to be generated {
>>>>   for each report data point to be calculated {
>>>> calculate data point
>>>> add data point to report
>>>>   }
>>>>   publish report
>>>> }
>>>>
>>>> In order to deal with the upper limits of these values, we will need to
>>>> distribute this algorithm to a compute / data cluster as much as
>>>> possible.
>>>>
>>>> I've read about frameworks such as Apache Spark but also Hadoop,
>>>> GridGain,
>>>> HazelCast and several others, and am still confused as to how each of
>>>> these
>>>> can help us and how they fit together.
>>>>
>>>> Is Spark the right framework for us?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>>> <http://nabble.com>.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: Is Spark right for us?

2016-03-07 Thread Guillaume Bilodeau
Hi everyone,

First thanks for taking some time on your Sunday to reply.  Some points in
no particular order:

. The feedback from everyone tells me that I have a lot of reading to do
first.  Thanks for all the pointers.
. The data is currently stored in a row-oriented database (SQL Server 2012
to be precise), but as I said we're open to moving data to a different kind
of data store (column-oriented, document, etc.)
. I don't have precise numbers for the size of the database, but I would
guess the larger ones have around 100 GB of data.  To us, this is huge;
obviously, for companies such as Google, it's a second's worth of data.
. For this particular issue, we're talking about ordinal data, not free
text fields.
. I agree that Spark is tooling, but I also see it as an implementation of
a specific design, namely distributed computing on a distributed data
store, if I understand correctly.
. For sure, I would like to avoid introducing a new technology to the mix,
so reusing the current infrastructure in a more optimal way would be our
first choice.
. Our main issue is that we'd like to be able to scale by distributing
instead of adding more memory to this single database.  The current
computations are done using SQL queries.  The data set does not fit in
memory.  So yes, we could distribute query construction and result
aggregation, but the database would still be the bottleneck.  That's why
I'm wondering if we should investigate technologies such as Spark or
Hadoop, but maybe I'm completely mistaken and we can leverage our current
infrastructure.

Thanks,
GB


On Mon, Mar 7, 2016 at 3:05 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> I think the Relational Database will be faster for ordinal data (eg where
> you answer from 1..x). For free text fields I would recommend solr or
> elastic search, because they have a lot more text analytics capabilities
> that do not exist in a relational database or MongoDB and are not likely to
> be there in the near future.
>
> On 06 Mar 2016, at 18:25, Guillaume Bilodeau <guillaume.bilod...@gmail.com>
> wrote:
>
> The data is currently stored in a relational database, but a migration to
> a document-oriented database such as MongoDb is something we are definitely
> considering.  How does this factor in?
>
> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> That depends on a lot of things, but as a starting point I would ask
>> whether you are planning to store your data in JSON format?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi <
>> guillaume.bilod...@gmail.com> wrote:
>>
>>> Our problem space is survey analytics.  Each survey comprises a set of
>>> questions, with each question having a set of possible answers.  Survey
>>> fill-out tasks are sent to users, who have until a certain date to
>>> complete
>>> it.  Based on these survey fill-outs, reports need to be generated.  Each
>>> report deals with a subset of the survey fill-outs, and comprises a set
>>> of
>>> data points (average rating for question 1, min/max for question 2, etc.)
>>>
>>> We are dealing with rather large data sets - although reading the
>>> internet
>>> we get the impression that everyone is analyzing petabytes of data...
>>>
>>> Users: up to 100,000
>>> Surveys: up to 100,000
>>> Questions per survey: up to 100
>>> Possible answers per question: up to 10
>>> Survey fill-outs / user: up to 10
>>> Reports: up to 100,000
>>> Data points per report: up to 100
>>>
>>> Data is currently stored in a relational database but a migration to a
>>> different kind of store is possible.
>>>
>>> The naive algorithm for report generation can be summed up as this:
>>>
>>> for each report to be generated {
>>>   for each report data point to be calculated {
>>> calculate data point
>>> add data point to report
>>>   }
>>>   publish report
>>> }
>>>
>>> In order to deal with the upper limits of these values, we will need to
>>> distribute this algorithm to a compute / data cluster as much as
>>> possible.
>>>
>>> I've read about frameworks such as Apache Spark but also Hadoop,
>>> GridGain,
>>> HazelCast and several others, and am still confused as to how each of
>>> these
>>> can help us and how they fit together.
>>>
>>> Is Spark the right framework for us?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com>.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Is Spark right for us?

2016-03-07 Thread Jörn Franke
I think the Relational Database will be faster for ordinal data (eg where you 
answer from 1..x). For free text fields I would recommend solr or elastic 
search, because they have a lot more text analytics capabilities that do not 
exist in a relational database or MongoDB and are not likely to be there in the 
near future.

> On 06 Mar 2016, at 18:25, Guillaume Bilodeau <guillaume.bilod...@gmail.com> 
> wrote:
> 
> The data is currently stored in a relational database, but a migration to a 
> document-oriented database such as MongoDb is something we are definitely 
> considering.  How does this factor in?
> 
>> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <gourav.sengu...@gmail.com> 
>> wrote:
>> Hi,
>> 
>> That depends on a lot of things, but as a starting point I would ask whether 
>> you are planning to store your data in JSON format?
>> 
>> 
>> Regards,
>> Gourav Sengupta
>> 
>>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi 
>>> <guillaume.bilod...@gmail.com> wrote:
>>> Our problem space is survey analytics.  Each survey comprises a set of
>>> questions, with each question having a set of possible answers.  Survey
>>> fill-out tasks are sent to users, who have until a certain date to complete
>>> it.  Based on these survey fill-outs, reports need to be generated.  Each
>>> report deals with a subset of the survey fill-outs, and comprises a set of
>>> data points (average rating for question 1, min/max for question 2, etc.)
>>> 
>>> We are dealing with rather large data sets - although reading the internet
>>> we get the impression that everyone is analyzing petabytes of data...
>>> 
>>> Users: up to 100,000
>>> Surveys: up to 100,000
>>> Questions per survey: up to 100
>>> Possible answers per question: up to 10
>>> Survey fill-outs / user: up to 10
>>> Reports: up to 100,000
>>> Data points per report: up to 100
>>> 
>>> Data is currently stored in a relational database but a migration to a
>>> different kind of store is possible.
>>> 
>>> The naive algorithm for report generation can be summed up as this:
>>> 
>>> for each report to be generated {
>>>   for each report data point to be calculated {
>>> calculate data point
>>> add data point to report
>>>   }
>>>   publish report
>>> }
>>> 
>>> In order to deal with the upper limits of these values, we will need to
>>> distribute this algorithm to a compute / data cluster as much as possible.
>>> 
>>> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
>>> HazelCast and several others, and am still confused as to how each of these
>>> can help us and how they fit together.
>>> 
>>> Is Spark the right framework for us?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Re: Is Spark right for us?

2016-03-06 Thread Chris Miller
Gut instinct is no, Spark is overkill for your needs... you should be able
to accomplish all of that with a relational database or a column oriented
database (depending on the types of queries you most frequently run and the
performance requirements).

--
Chris Miller

On Mon, Mar 7, 2016 at 1:17 AM, Laumegui Deaulobi <
guillaume.bilod...@gmail.com> wrote:

> Our problem space is survey analytics.  Each survey comprises a set of
> questions, with each question having a set of possible answers.  Survey
> fill-out tasks are sent to users, who have until a certain date to complete
> it.  Based on these survey fill-outs, reports need to be generated.  Each
> report deals with a subset of the survey fill-outs, and comprises a set of
> data points (average rating for question 1, min/max for question 2, etc.)
>
> We are dealing with rather large data sets - although reading the internet
> we get the impression that everyone is analyzing petabytes of data...
>
> Users: up to 100,000
> Surveys: up to 100,000
> Questions per survey: up to 100
> Possible answers per question: up to 10
> Survey fill-outs / user: up to 10
> Reports: up to 100,000
> Data points per report: up to 100
>
> Data is currently stored in a relational database but a migration to a
> different kind of store is possible.
>
> The naive algorithm for report generation can be summed up as this:
>
> for each report to be generated {
>   for each report data point to be calculated {
> calculate data point
> add data point to report
>   }
>   publish report
> }
>
> In order to deal with the upper limits of these values, we will need to
> distribute this algorithm to a compute / data cluster as much as possible.
>
> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
> HazelCast and several others, and am still confused as to how each of these
> can help us and how they fit together.
>
> Is Spark the right framework for us?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Is Spark right for us?

2016-03-06 Thread Peyman Mohajerian
if your relational database has enough computing power, you don't have to
change it. You can just run SQL queries on top of it or even run Spark
queries over it. There is no hard-fast rule about using big data tools.
Usually people or organizations don't jump into big data for one specific
use case, it is a journey that involves multiple use cases, and future
growth and a lot more. If you data is already fitting in relational store
and you can used the existing SQL analytics and BI tool why consider other
options, unless you want to learn something new or data will grow over time
and you want to future proof it.

On Sun, Mar 6, 2016 at 12:59 PM, Krishna Sankar <ksanka...@gmail.com> wrote:

> Good question. It comes to computational complexity, computational scale
> and data volume.
>
>1. If you can store the data in a single server or a small cluster of
>db server (say mysql) then hdfs/Spark might be an overkill
>2. If you can run the computation/process the data on a single machine
>(remember servers with 512 GB memory and quad core CPUs can do a lot of
>stuff) then Spark is an overkill
>3. Even if you can do computations #1 & #2 above, in a pipeline and
>tolerate the elapsed time, Spark might be an overkill
>4. But if you require data/computation parallelism or distributed
>processing of data due to computation complexities or data volume or time
>constraints incl real time analytics, Spark is the right stack.
>5. Taking a quick look at what you have described so far, probably
>Spark is not needed.
>
> Cheers & HTH
> 
>
> On Sun, Mar 6, 2016 at 9:17 AM, Laumegui Deaulobi <
> guillaume.bilod...@gmail.com> wrote:
>
>> Our problem space is survey analytics.  Each survey comprises a set of
>> questions, with each question having a set of possible answers.  Survey
>> fill-out tasks are sent to users, who have until a certain date to
>> complete
>> it.  Based on these survey fill-outs, reports need to be generated.  Each
>> report deals with a subset of the survey fill-outs, and comprises a set of
>> data points (average rating for question 1, min/max for question 2, etc.)
>>
>> We are dealing with rather large data sets - although reading the internet
>> we get the impression that everyone is analyzing petabytes of data...
>>
>> Users: up to 100,000
>> Surveys: up to 100,000
>> Questions per survey: up to 100
>> Possible answers per question: up to 10
>> Survey fill-outs / user: up to 10
>> Reports: up to 100,000
>> Data points per report: up to 100
>>
>> Data is currently stored in a relational database but a migration to a
>> different kind of store is possible.
>>
>> The naive algorithm for report generation can be summed up as this:
>>
>> for each report to be generated {
>>   for each report data point to be calculated {
>> calculate data point
>> add data point to report
>>   }
>>   publish report
>> }
>>
>> In order to deal with the upper limits of these values, we will need to
>> distribute this algorithm to a compute / data cluster as much as possible.
>>
>> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
>> HazelCast and several others, and am still confused as to how each of
>> these
>> can help us and how they fit together.
>>
>> Is Spark the right framework for us?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is Spark right for us?

2016-03-06 Thread Krishna Sankar
Good question. It comes to computational complexity, computational scale
and data volume.

   1. If you can store the data in a single server or a small cluster of db
   server (say mysql) then hdfs/Spark might be an overkill
   2. If you can run the computation/process the data on a single machine
   (remember servers with 512 GB memory and quad core CPUs can do a lot of
   stuff) then Spark is an overkill
   3. Even if you can do computations #1 & #2 above, in a pipeline and
   tolerate the elapsed time, Spark might be an overkill
   4. But if you require data/computation parallelism or distributed
   processing of data due to computation complexities or data volume or time
   constraints incl real time analytics, Spark is the right stack.
   5. Taking a quick look at what you have described so far, probably Spark
   is not needed.

Cheers & HTH


On Sun, Mar 6, 2016 at 9:17 AM, Laumegui Deaulobi <
guillaume.bilod...@gmail.com> wrote:

> Our problem space is survey analytics.  Each survey comprises a set of
> questions, with each question having a set of possible answers.  Survey
> fill-out tasks are sent to users, who have until a certain date to complete
> it.  Based on these survey fill-outs, reports need to be generated.  Each
> report deals with a subset of the survey fill-outs, and comprises a set of
> data points (average rating for question 1, min/max for question 2, etc.)
>
> We are dealing with rather large data sets - although reading the internet
> we get the impression that everyone is analyzing petabytes of data...
>
> Users: up to 100,000
> Surveys: up to 100,000
> Questions per survey: up to 100
> Possible answers per question: up to 10
> Survey fill-outs / user: up to 10
> Reports: up to 100,000
> Data points per report: up to 100
>
> Data is currently stored in a relational database but a migration to a
> different kind of store is possible.
>
> The naive algorithm for report generation can be summed up as this:
>
> for each report to be generated {
>   for each report data point to be calculated {
> calculate data point
> add data point to report
>   }
>   publish report
> }
>
> In order to deal with the upper limits of these values, we will need to
> distribute this algorithm to a compute / data cluster as much as possible.
>
> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
> HazelCast and several others, and am still confused as to how each of these
> can help us and how they fit together.
>
> Is Spark the right framework for us?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
ata points (average rating for question 1, min/max for question 2,
>>>>> etc.)
>>>>>
>>>>> We are dealing with rather large data sets - although reading the
>>>>> internet
>>>>> we get the impression that everyone is analyzing petabytes of data...
>>>>>
>>>>> Users: up to 100,000
>>>>> Surveys: up to 100,000
>>>>> Questions per survey: up to 100
>>>>> Possible answers per question: up to 10
>>>>> Survey fill-outs / user: up to 10
>>>>> Reports: up to 100,000
>>>>> Data points per report: up to 100
>>>>>
>>>>> Data is currently stored in a relational database but a migration to a
>>>>> different kind of store is possible.
>>>>>
>>>>> The naive algorithm for report generation can be summed up as this:
>>>>>
>>>>> for each report to be generated {
>>>>>   for each report data point to be calculated {
>>>>> calculate data point
>>>>> add data point to report
>>>>>   }
>>>>>   publish report
>>>>> }
>>>>>
>>>>> In order to deal with the upper limits of these values, we will need to
>>>>> distribute this algorithm to a compute / data cluster as much as
>>>>> possible.
>>>>>
>>>>> I've read about frameworks such as Apache Spark but also Hadoop,
>>>>> GridGain,
>>>>> HazelCast and several others, and am still confused as to how each of
>>>>> these
>>>>> can help us and how they fit together.
>>>>>
>>>>> Is Spark the right framework for us?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Is Spark right for us?

2016-03-06 Thread Mich Talebzadeh
>>> Data is currently stored in a relational database but a migration to a
>>>> different kind of store is possible.
>>>>
>>>> The naive algorithm for report generation can be summed up as this:
>>>>
>>>> for each report to be generated {
>>>>   for each report data point to be calculated {
>>>> calculate data point
>>>>     add data point to report
>>>>   }
>>>>   publish report
>>>> }
>>>>
>>>> In order to deal with the upper limits of these values, we will need to
>>>> distribute this algorithm to a compute / data cluster as much as
>>>> possible.
>>>>
>>>> I've read about frameworks such as Apache Spark but also Hadoop,
>>>> GridGain,
>>>> HazelCast and several others, and am still confused as to how each of
>>>> these
>>>> can help us and how they fit together.
>>>>
>>>> Is Spark the right framework for us?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
Hi,

SPARK is just tooling, and its even not tooling. You can consider SPARK a
distributed operating system like YARN. You should read books like HADOOP
Application Architecture, Big Data (Nathan Marz) and other disciplines
before starting to consider how the solution is built.

Most of the big data projects (like any other BI projects) do not deliver
value or turn extremely expensive to maintain because the approach is that
tools solve the problem.


Regards,
Gourav Sengupta

On Sun, Mar 6, 2016 at 5:25 PM, Guillaume Bilodeau <
guillaume.bilod...@gmail.com> wrote:

> The data is currently stored in a relational database, but a migration to
> a document-oriented database such as MongoDb is something we are definitely
> considering.  How does this factor in?
>
> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> That depends on a lot of things, but as a starting point I would ask
>> whether you are planning to store your data in JSON format?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi <
>> guillaume.bilod...@gmail.com> wrote:
>>
>>> Our problem space is survey analytics.  Each survey comprises a set of
>>> questions, with each question having a set of possible answers.  Survey
>>> fill-out tasks are sent to users, who have until a certain date to
>>> complete
>>> it.  Based on these survey fill-outs, reports need to be generated.  Each
>>> report deals with a subset of the survey fill-outs, and comprises a set
>>> of
>>> data points (average rating for question 1, min/max for question 2, etc.)
>>>
>>> We are dealing with rather large data sets - although reading the
>>> internet
>>> we get the impression that everyone is analyzing petabytes of data...
>>>
>>> Users: up to 100,000
>>> Surveys: up to 100,000
>>> Questions per survey: up to 100
>>> Possible answers per question: up to 10
>>> Survey fill-outs / user: up to 10
>>> Reports: up to 100,000
>>> Data points per report: up to 100
>>>
>>> Data is currently stored in a relational database but a migration to a
>>> different kind of store is possible.
>>>
>>> The naive algorithm for report generation can be summed up as this:
>>>
>>> for each report to be generated {
>>>   for each report data point to be calculated {
>>> calculate data point
>>> add data point to report
>>>   }
>>>   publish report
>>> }
>>>
>>> In order to deal with the upper limits of these values, we will need to
>>> distribute this algorithm to a compute / data cluster as much as
>>> possible.
>>>
>>> I've read about frameworks such as Apache Spark but also Hadoop,
>>> GridGain,
>>> HazelCast and several others, and am still confused as to how each of
>>> these
>>> can help us and how they fit together.
>>>
>>> Is Spark the right framework for us?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Is Spark right for us?

2016-03-06 Thread Gourav Sengupta
Hi,

That depends on a lot of things, but as a starting point I would ask
whether you are planning to store your data in JSON format?


Regards,
Gourav Sengupta

On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi <
guillaume.bilod...@gmail.com> wrote:

> Our problem space is survey analytics.  Each survey comprises a set of
> questions, with each question having a set of possible answers.  Survey
> fill-out tasks are sent to users, who have until a certain date to complete
> it.  Based on these survey fill-outs, reports need to be generated.  Each
> report deals with a subset of the survey fill-outs, and comprises a set of
> data points (average rating for question 1, min/max for question 2, etc.)
>
> We are dealing with rather large data sets - although reading the internet
> we get the impression that everyone is analyzing petabytes of data...
>
> Users: up to 100,000
> Surveys: up to 100,000
> Questions per survey: up to 100
> Possible answers per question: up to 10
> Survey fill-outs / user: up to 10
> Reports: up to 100,000
> Data points per report: up to 100
>
> Data is currently stored in a relational database but a migration to a
> different kind of store is possible.
>
> The naive algorithm for report generation can be summed up as this:
>
> for each report to be generated {
>   for each report data point to be calculated {
> calculate data point
> add data point to report
>   }
>   publish report
> }
>
> In order to deal with the upper limits of these values, we will need to
> distribute this algorithm to a compute / data cluster as much as possible.
>
> I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
> HazelCast and several others, and am still confused as to how each of these
> can help us and how they fit together.
>
> Is Spark the right framework for us?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is Spark right for us?

2016-03-06 Thread Laumegui Deaulobi
Our problem space is survey analytics.  Each survey comprises a set of
questions, with each question having a set of possible answers.  Survey
fill-out tasks are sent to users, who have until a certain date to complete
it.  Based on these survey fill-outs, reports need to be generated.  Each
report deals with a subset of the survey fill-outs, and comprises a set of
data points (average rating for question 1, min/max for question 2, etc.)

We are dealing with rather large data sets - although reading the internet
we get the impression that everyone is analyzing petabytes of data...

Users: up to 100,000
Surveys: up to 100,000
Questions per survey: up to 100
Possible answers per question: up to 10
Survey fill-outs / user: up to 10
Reports: up to 100,000
Data points per report: up to 100

Data is currently stored in a relational database but a migration to a
different kind of store is possible.

The naive algorithm for report generation can be summed up as this:

for each report to be generated {
  for each report data point to be calculated {
calculate data point
add data point to report
  }
  publish report
}

In order to deal with the upper limits of these values, we will need to
distribute this algorithm to a compute / data cluster as much as possible.

I've read about frameworks such as Apache Spark but also Hadoop, GridGain,
HazelCast and several others, and am still confused as to how each of these
can help us and how they fit together.

Is Spark the right framework for us?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org