Hi all,
We are using embedded Spark 1.6.2 in our analytics platform[1]. For the
cluster communication we use hazel-cast clustering capabilities. From
Hazelcast side we set the following configurations, in order to configure
the hearbeat properties.
hazelcast.max.no.heartbeat.seconds=30
Alcon,
You can most certainly do this. I’ve done benchmarking with Spark SQL and the
TPCDS queries using S3 as the filesystem.
Zeppelin and Livy server work well for the dash boarding and concurrent query
issues: https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/
Livy
It's actually quite simple to answer
> 1. Is Spark SQL and UDF, able to handle all the workloads?
Yes
> 2. What user interface did you provide for data scientist, data engineers
and analysts
Home-grown platform, EMR, Zeppelin
> What are the challenges in running concurrent queries, by many
Dear Ashish,
what you are asking for involves at least a few weeks of dedicated
understanding of your used case and then it takes at least 3 to 4 months to
even propose a solution. You can even build a fantastic data warehouse just
using C++. The matter depends on lots of conditions. I just think
Hi, Ashish.
You are correct in saying that not *all* functionality of Spark is
spill-to-disk but I am not sure how this pertains to a "concurrent user
scenario". Each executor will run in its own JVM and is therefore isolated
from others. That is, if the JVM of one user dies, this should not
Thanks Jorn and Phillip. My question was specifically to anyone who have
tried creating a system using spark SQL, as Data Warehouse. I was trying to
check, if someone has tried it and they can help with the kind of workloads
which worked and the ones, which have problems.
Regarding spill to disk,
Agree with Jorn. The answer is: it depends.
In the past, I've worked with data scientists who are happy to use the
Spark CLI. Again, the answer is "it depends" (in this case, on the skills
of your customers).
Regarding sharing resources, different teams were limited to their own
queue so they
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.
We do not know the requirements of your data scientists now or in the future so
it is difficult to say. How do they work currently without the new solution? Do
they all work on the same data? I