My team is building a batch data processing pipeline using Spark API and trying to understand if Spark SQL can help us. Below are what we found so far:
- SQL's declarative style may be more readable in some cases (e.g. joining of more than two RDDs), although some devs prefer the fluent style regardless. - Cogrouping of more than 4 RDDs is not supported and it's not clear if Spark SQL supports joining of arbitrary number of RDDs. - It seems that Spark SQL's features such as optimization based on predicate pushdown and dynamic schema inference are less applicable in a batch environment. Your inputs/suggestions are most welcome! Thanks, Vu Ha CTO, Semantic Scholar http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org