[ 
https://issues.apache.org/jira/browse/SPARK-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224183#comment-14224183
 ] 

Cheng Lian commented on SPARK-4395:
-----------------------------------

Couldn't reproduce the long pause locally, and unfortunately due to some 
reason, I don't have stable EC2 access right now. Some comments about caching 
here:

# Calling {{.cache()}} before {{inferSchema}} calls normal {{RDD.cache()}} on 
JVM side with a default {{MEMORY_ONLY}} storage level, and RDD elements are 
cached tuple by tuple as plain deserialized Java objects.
# Calling {{.cache()}} after {{inferSchema}} calls {{SchemaRDD.cache()}} on JVM 
side with a default {{MEMORY_AND_DISK}} storage level, and RDD elements are 
compactly serialized and cached in in-memory columnar format. In this way, far 
less Java object remain in JVM heap and thus lowers GC pressure. The default 
storage level is different because building in-memory columnar table can be 
costly for large tables. 
# If we want to manipulate the cached data with Spark SQL, the second caching 
strategy is almost always the better choice.

However, the test data used here only consist of about 4M integers, I really 
don't think caching format can cause so significant performance differences...

> Running a Spark SQL SELECT command from PySpark causes a hang for ~ 1 hour
> --------------------------------------------------------------------------
>
>                 Key: SPARK-4395
>                 URL: https://issues.apache.org/jira/browse/SPARK-4395
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: version 1.2.0-SNAPSHOT
>            Reporter: Sameer Farooqui
>
> When I run this command it hangs for one to many hours and then finally 
> returns with successful results:
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> Note, the lab environment below is still active, so let me know if you'd like 
> to just access it directly.
> +++ My Environment +++
> - 1-node cluster in Amazon
> - RedHat 6.5 64-bit
> - java version "1.7.0_67"
> - SBT version: sbt-0.13.5
> - Scala version: scala-2.11.2
> Ran: 
> sudo yum -y update
> git clone https://github.com/apache/spark
> sudo sbt assembly
> +++ Data file used +++
> http://blueplastic.com/databricks/movielens/ratings.dat
> {code}
> >>> import re
> >>> import string
> >>> from pyspark.sql import SQLContext, Row
> >>> sqlContext = SQLContext(sc)
> >>> RATINGS_PATTERN = '^(\d+)::(\d+)::(\d+)::(\d+)'
> >>>
> >>> def parse_ratings_line(line):
> ...     match = re.search(RATINGS_PATTERN, line)
> ...     if match is None:
> ...         # Optionally, you can change this to just ignore if each line of 
> data is not critical.
> ...         raise Error("Invalid logline: %s" % logline)
> ...     return Row(
> ...         UserID    = int(match.group(1)),
> ...         MovieID   = int(match.group(2)),
> ...         Rating    = int(match.group(3)),
> ...         Timestamp = int(match.group(4)))
> ...
> >>> ratings_base_RDD = 
> >>> (sc.textFile("file:///home/ec2-user/movielens/ratings.dat")
> ...                # Call the parse_apace_log_line function on each line.
> ...                .map(parse_ratings_line)
> ...                # Caches the objects in memory since they will be queried 
> multiple times.
> ...                .cache())
> >>> ratings_base_RDD.count()
> 1000209
> >>> ratings_base_RDD.first()
> Row(MovieID=1193, Rating=5, Timestamp=978300760, UserID=1)
> >>> schemaRatings = sqlContext.inferSchema(ratings_base_RDD)
> >>> schemaRatings.registerTempTable("RatingsTable")
> >>> sqlContext.sql("SELECT * FROM RatingsTable limit 5").collect()
> {code}
> (Now the Python shell hangs...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to