Github user staslos commented on the pull request:
https://github.com/apache/spark/pull/4434#issuecomment-103645571
We've been using both Spark Core and Spark SQL for over 6 months by now.
For sure, we're not an experts here, but we found that Spark Core better suits
our data pipeline (as a Pig replacement), while Spark SQL is more analytical
tool. When it's about data moving and transformation, we prefer Spark Core to
Spark SQL's 'magic' because Spark Core is more stable, gives us more control
over the process and more confidence.
Also, last time I checked on Spark SQL, I couldn't achieve proper Avro
schema evolution which is absolutely critical for our data pipeline dealing
with different version of the same data. An ability to provide reader and
writer schema is proceless. I couldn't find the way to do this in Spark SQL.
Our data scientists have to use projection in Spark SQL to be able to read
across different versions of data. Lucky them, they don't need to use all the
fields and pass them down the pipeline.
Also, correct me if I'm wrong, Spark SQL is not production ready yet. Our
latest upgrade from Spark 1.2.0 to 1.3.0 proved we were right sticking with
Spark Core, at least for now, while our data scientists were going mad since
their Spark SQL scripts stopped working with S3.
Anyway, thank you guys, for doing the great job. Feel free to toss this
pull request, I was just thinking back in February it could be useful for other
people facing the same problem.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]