My team is building a batch data processing pipeline using Spark API and
trying to understand if Spark SQL can help us. Below are what we found so
far:

- SQL's declarative style may be more readable in some cases (e.g. joining
of more than two RDDs), although some devs prefer the fluent style
regardless. 
- Cogrouping of more than 4 RDDs is not supported and it's not clear if
Spark SQL supports joining of arbitrary number of RDDs.
- It seems that Spark SQL's features such as optimization based on predicate
pushdown and dynamic schema inference are less applicable in a batch
environment.

Your inputs/suggestions are most welcome!

Thanks,
Vu Ha
CTO, Semantic Scholar
http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to