Hi,

SPARK is just tooling, and its even not tooling. You can consider SPARK a
distributed operating system like YARN. You should read books like HADOOP
Application Architecture, Big Data (Nathan Marz) and other disciplines
before starting to consider how the solution is built.

Most of the big data projects (like any other BI projects) do not deliver
value or turn extremely expensive to maintain because the approach is that
tools solve the problem.


Regards,
Gourav Sengupta

On Sun, Mar 6, 2016 at 5:25 PM, Guillaume Bilodeau <
guillaume.bilod...@gmail.com> wrote:

> The data is currently stored in a relational database, but a migration to
> a document-oriented database such as MongoDb is something we are definitely
> considering.  How does this factor in?
>
> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> That depends on a lot of things, but as a starting point I would ask
>> whether you are planning to store your data in JSON format?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi <
>> guillaume.bilod...@gmail.com> wrote:
>>
>>> Our problem space is survey analytics.  Each survey comprises a set of
>>> questions, with each question having a set of possible answers.  Survey
>>> fill-out tasks are sent to users, who have until a certain date to
>>> complete
>>> it.  Based on these survey fill-outs, reports need to be generated.  Each
>>> report deals with a subset of the survey fill-outs, and comprises a set
>>> of
>>> data points (average rating for question 1, min/max for question 2, etc.)
>>>
>>> We are dealing with rather large data sets - although reading the
>>> internet
>>> we get the impression that everyone is analyzing petabytes of data...
>>>
>>> Users: up to 100,000
>>> Surveys: up to 100,000
>>> Questions per survey: up to 100
>>> Possible answers per question: up to 10
>>> Survey fill-outs / user: up to 10
>>> Reports: up to 100,000
>>> Data points per report: up to 100
>>>
>>> Data is currently stored in a relational database but a migration to a
>>> different kind of store is possible.
>>>
>>> The naive algorithm for report generation can be summed up as this:
>>>
>>> for each report to be generated {
>>>   for each report data point to be calculated {
>>>     calculate data point
>>>     add data point to report
>>>   }
>>>   publish report
>>> }
>>>
>>> In order to deal with the upper limits of these values, we will need to
>>> distribute this algorithm to a compute / data cluster as much as
>>> possible.
>>>
>>> I've read about frameworks such as Apache Spark but also Hadoop,
>>> GridGain,
>>> HazelCast and several others, and am still confused as to how each of
>>> these
>>> can help us and how they fit together.
>>>
>>> Is Spark the right framework for us?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to