I discovered today that EMR provides its own optimizations for Spark
<https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-performance.html>.
Some of these optimizations are controlled by configuration settings with
names like `spark.sql.dynamicPartitionPruning.enabled` or
`spark.sql.optimizer.flattenScalarSubqueriesWithAggregates.enabled`. As far
as I can tell <http://spark.apache.org/docs/latest/configuration.html>,
these are EMR-specific configurations.

Does this create a potential problem, since it's possible that future
Apache Spark configuration settings may end up colliding with these names
selected by EMR?

Should we document some sort of third-party configuration namespace pattern
and encourage third parties to scope their custom configurations to that
area? e.g. Something like `spark.external.[vendor].[whatever]`.

Nick

Reply via email to