Assaf Mendelson created SPARK-17334:
---------------------------------------
Summary: Provide management tools for broadcasted variables
Key: SPARK-17334
URL: https://issues.apache.org/jira/browse/SPARK-17334
Project: Spark
Issue Type: New Feature
Components: Spark Core
Reporter: Assaf Mendelson
Priority: Minor
I propose to provide some management tools to manage broadcasted variables.
The main issue today is that broadcast must contain a reference which should be
saved and used and we need to know if we already unpersisted it and we do not
know where it takes memory and how much.
Consider the following:
Today we can create a broadcast variable, use it and destroy it later by saving
the reference.
Consider the example from the documentation
>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]
>>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> b.unpersist()
The problem is that b needs to be saved and passed along.
Instead I would like to see something like:
>>> sc.broadcast("b",[1, 2, 3, 4, 5])
>>> sc.getBroadcasted()
["a", "b", "c"]
>>> sc.getBroadcastInfo("b")
{"mem[bytes]":10, "type": List, "materializedExecutors" : [1,2,3,6,7]}
>>> b = sc.getBroadcastRef("b")
>>> print b.value
[1, 2, 3, 4, 5]
>>> sc.unpersist("b")
maybe also add some per executor map to see what each executor contains.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]