oulenz commented on issue #23715: [SPARK-26803][PYTHON] Add sbin subdirectory to pyspark URL: https://github.com/apache/spark/pull/23715#issuecomment-459950912 > Yep I'm pretty sure now that the pip package has all those jars so you can run locally, so 'pyspark' does anything at all without a cluster. Local mode is just for testing and experiments; Spark doesn't have much point on one machine for anything 'real'. I don't think running a history server with local execution is a reasonable use case as a result. I think the problem is that you and other developpers have been using spark for years and know exactly what you are doing. For someone new to spark, everything starts with testing and experiments and trying to understand how jobs are executed. For that, the history server is essential. Otherwise spark is just a black box, right? Let me give you a concrete example. I am implementing an algorithm which unfortunately involves a near-cross join. Using the history server I - could confirm that the large join was indeed the stage that was taking most time - discovered that spark was using only a few large tasks for the join that were waiting on each other, so I had to repartition the tables involved - realised that spark throws away dataframes once they are used, so the computationally intensive join was actually calculated *twice*, and so I learned about caching > > Maybe more to the point: those scripts don't work without a Spark distro; they expect SPARK_HOME to be set or to be run from within an unpacked distribution. Maybe you'll shock me again by saying it really does happen to work with the pip package layout too, but I've never understood that to be supported or the intent, something that's being maintained. > > Does the script even work when packaged this way? Yes that's what I've been trying to tell you. I've manually downloaded the `sbin` folder into my pyspark folder and have been using the history server no problemo. Otherwise I wouldn't be proposing this PR.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
