[ https://issues.apache.org/jira/browse/SPARK-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735141#comment-14735141 ]
Brad Willard edited comment on SPARK-10488 at 9/8/15 4:54 PM: -------------------------------------------------------------- [~srowen] I have it working via that method at the moment. It's just annoying because it forces each notebook to use the same amount of resources on the cluster whereas I was able to configure that through SparkConf on all previous versions of spark with the above code. My usecase is that I have some notebooks that are doing a deep historical job on 4 billion rows that require the entire cluster and would request those resources, however other notebooks would look at smaller datasets (1-5 million) that require only 1/10th of the cluster. I really dislike that I've lost that configurability now. was (Author: brdwrd): [~srowen] I have it working via that method at the moment. It's just annoying because it forces each notebook to use the same amount of resources on the cluster whereas I was able to configure that through SparkConf on all previous versions of spark with the above code. So I have some notebook that are doing a deep historical job on 4 billino rows that require the entire cluster and would request those resources, however other notebooks would look at smaller datasets (1-5 million) that require only 1/10th of the cluster. I really dislike that I've lost that configurability now. > No longer possible to create SparkConf in pyspark application > ------------------------------------------------------------- > > Key: SPARK-10488 > URL: https://issues.apache.org/jira/browse/SPARK-10488 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.4.0, 1.4.1 > Environment: pyspark on ec2 deployed cluster > Reporter: Brad Willard > > I used to be able to make SparkContext connections directly in ipython > notebooks so that each notebook could have different resources on the > cluster. This worked perfectly until spark 1.4.x. > This code worked on all previous version of spark and no longer works > {code} > from pyspark import SparkConf, SparkContext > from pyspark.sql import SQLContext > cpus = 15 > ram = 5 > conf = SparkConf().set('spark.executor.memory', '%sg' % > ram).set('spark.cores.max', str(cpus)) > cluster_url = 'spark://%s:7077' % master > job_name = 'test' > sc = SparkContext(cluster_url, job_name, conf=confg) > {code} > It errors on the SparkConf() line because you can't even make that object in > python now without the SparkContext already created....which makes no sense > to me. > {code} > --------------------------------------------------------------------------- > Exception Traceback (most recent call last) > <ipython-input-4-453520c03f2b> in <module>() > 5 ram = 5 > 6 > ----> 7 conf = SparkConf().set('spark.executor.memory', '%sg' % > ram).set('spark.cores.max', str(cpus)) > 8 > 9 cluster_url = 'spark://%s:7077' % master > /root/spark/python/pyspark/conf.py in __init__(self, loadDefaults, _jvm, > _jconf) > 102 else: > 103 from pyspark.context import SparkContext > --> 104 SparkContext._ensure_initialized() > 105 _jvm = _jvm or SparkContext._jvm > 106 self._jconf = _jvm.SparkConf(loadDefaults) > /root/spark/python/pyspark/context.py in _ensure_initialized(cls, instance, > gateway) > 227 with SparkContext._lock: > 228 if not SparkContext._gateway: > --> 229 SparkContext._gateway = gateway or launch_gateway() > 230 SparkContext._jvm = SparkContext._gateway.jvm > 231 > /root/spark/python/pyspark/java_gateway.py in launch_gateway() > 87 callback_socket.close() > 88 if gateway_port is None: > ---> 89 raise Exception("Java gateway process exited before > sending the driver its port number") > 90 > 91 # In Windows, ensure the Java child processes do not linger > after Python has exited. > Exception: Java gateway process exited before sending the driver its port > number > {code} > I am able to work by setting all the pyspark environmental ipython notebook > variables, but then each notebook is forced to have the same resources which > isn't great if you run lots of different types of jobs ad hoc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org