[ https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224183#comment-14224183 ]
Cheng Lian commented on SPARK-4395: ----------------------------------- Couldn't reproduce the long pause locally, and unfortunately due to some reason, I don't have stable EC2 access right now. Some comments about caching here: # Calling {{.cache()}} before {{inferSchema}} calls normal {{RDD.cache()}} on JVM side with a default {{MEMORY_ONLY}} storage level, and RDD elements are cached tuple by tuple as plain deserialized Java objects. # Calling {{.cache()}} after {{inferSchema}} calls {{SchemaRDD.cache()}} on JVM side with a default {{MEMORY_AND_DISK}} storage level, and RDD elements are compactly serialized and cached in in-memory columnar format. In this way, far less Java object remain in JVM heap and thus lowers GC pressure. The default storage level is different because building in-memory columnar table can be costly for large tables. # If we want to manipulate the cached data with Spark SQL, the second caching strategy is almost always the better choice. However, the test data used here only consist of about 4M integers, I really don't think caching format can cause so significant performance differences... > Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour > -------------------------------------------------------------------------- > > Key: SPARK-4395 > URL: https://issues.apache.org/jira/browse/SPARK-4395 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.2.0 > Environment: version 1.2.0-SNAPSHOT > Reporter: Sameer Farooqui > > When I run this command it hangs for one to many hours and then finally > returns with successful results: > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > Note, the lab environment below is still active, so let me know if you'd like > to just access it directly. > +++ My Environment +++ > - 1-node cluster in Amazon > - RedHat 6.5 64-bit > - java version "1.7.0_67" > - SBT version: sbt-0.13.5 > - Scala version: scala-2.11.2 > Ran: > sudo yum -y update > git clone https://github.com/apache/spark > sudo sbt assembly > +++ Data file used +++ > http://blueplastic.com/databricks/movielens/ratings.dat > {code} > >>> import re > >>> import string > >>> from pyspark.sql import SQLContext, Row > >>> sqlContext = SQLContext(sc) > >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)' > >>> > >>> def parse_ratings_line(line): > ... match = re.search(RATINGS_PATTERN, line) > ... if match is None: > ... # Optionally, you can change this to just ignore if each line of > data is not critical. > ... raise Error("Invalid logline: %s" % logline) > ... return Row( > ... UserID = int(match.group(1)), > ... MovieID = int(match.group(2)), > ... Rating = int(match.group(3)), > ... Timestamp = int(match.group(4))) > ... > >>> ratings_base_RDD = > >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat") > ... # Call the parse_apace_log_line function on each line. > ... .map(parse_ratings_line) > ... # Caches the objects in memory since they will be queried > multiple times. > ... .cache()) > >>> ratings_base_RDD.count() > 1000209 > >>> ratings_base_RDD.first() > Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1) > >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD) > >>> schemaRatings.registerTempTable("RatingsTable") > >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect() > {code} > (Now the Python shell hangs...) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org