[GitHub] oulenz commented on issue #23715: [SPARK-26803][PYTHON] Add sbin subdirectory to pyspark

GitBox Sat, 02 Feb 2019 01:37:22 -0800

oulenz commented on issue #23715: [SPARK-26803][PYTHON] Add sbin subdirectory 
to pyspark
URL: https://github.com/apache/spark/pull/23715#issuecomment-459950912
 
 
   > Yep I'm pretty sure now that the pip package has all those jars so you can 
run locally, so 'pyspark' does anything at all without a cluster. Local mode is 
just for testing and experiments; Spark doesn't have much point on one machine 
for anything 'real'. I don't think running a history server with local 
execution is a reasonable use case as a result.
   
   I think the problem is that you and other developpers have been using spark 
for years and know exactly what you are doing. For someone new to spark, 
everything starts with testing and experiments and trying to understand how 
jobs are executed. For that, the history server is essential. Otherwise spark 
is just a black box, right?
   
   Let me give you a concrete example. I am implementing an algorithm which 
unfortunately involves a near-cross join. Using the history server I
   
   - could confirm that the large join was indeed the stage that was taking 
most time
   - discovered that spark was using only a few large tasks for the join that 
were waiting on each other, so I had to repartition the tables involved
   - realised that spark throws away dataframes once they are used, so the 
computationally intensive join was actually calculated *twice*, and so I 
learned about caching
   
   > 
   > Maybe more to the point: those scripts don't work without a Spark distro; 
they expect SPARK_HOME to be set or to be run from within an unpacked 
distribution. Maybe you'll shock me again by saying it really does happen to 
work with the pip package layout too, but I've never understood that to be 
supported or the intent, something that's being maintained.
   > 
   > Does the script even work when packaged this way?
   
   Yes that's what I've been trying to tell you. I've manually downloaded the 
`sbin` folder into my pyspark folder and have been using the history server no 
problemo. Otherwise I wouldn't be proposing this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] oulenz commented on issue #23715: [SPARK-26803][PYTHON] Add sbin subdirectory to pyspark

Reply via email to