<code> val enriched_web_logs = sqlContext.sql(""" select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as source_host, log from web_logs left outer join (select distinct node, address from nodes) b on source_ip = address """) enriched_web_logs.coalesce(1).write.format("parquet").mode("overwrite").save(bucket+"derived/enriched_web_logs") enriched_web_logs.registerTempTable("enriched_web_logs") sqlContext.cacheTable("enriched_web_logs") </code>
There are only 524 records in the resulting table, and I have explicitly attempted to coalesce into 1 partition. Yet my Spark UI shows 200 (mostly empty) partitions: RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in ExternalBlockStoreSize on Disk In-memory table enriched_web_logs <http://localhost:4040/storage/rdd?id=86> Memory Deserialized 1x Replicated 200 100% 22.0 KB 0.0 B 0.0 BWhy would there be 200 partitions despite the coalesce call?