Assaf Mendelson created SPARK-17334:
---------------------------------------

             Summary: Provide management tools for broadcasted variables
                 Key: SPARK-17334
                 URL: https://issues.apache.org/jira/browse/SPARK-17334
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
            Reporter: Assaf Mendelson
            Priority: Minor


I propose to provide some management tools to manage broadcasted variables. 
The main issue today is that broadcast must contain a reference which should be 
saved and used and we need to know if we already unpersisted it and we do not 
know where it takes memory and how much.

Consider the following:

Today we can create a broadcast variable, use it and destroy it later by saving 
the reference. 
Consider the example from the documentation
>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]
>>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> b.unpersist()

The problem is that b needs to be saved and passed along.

Instead I would like to see something like:

>>> sc.broadcast("b",[1, 2, 3, 4, 5])
>>> sc.getBroadcasted()
["a", "b", "c"]
>>> sc.getBroadcastInfo("b")
{"mem[bytes]":10, "type": List, "materializedExecutors" : [1,2,3,6,7]}
>>> b = sc.getBroadcastRef("b")
>>> print b.value
[1, 2, 3, 4, 5]
>>> sc.unpersist("b")

maybe also add some per executor map to see what each executor contains.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to