Re: Get size of rdd in memory
It's already fixed in the master branch. Sorry that we forgot to update this before releasing 1.2.0 and caused you trouble... Cheng On 2/2/15 2:03 PM, ankits wrote: Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the "branch-1.2" branch, https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md "Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will not be cached using the in-memory columnar format, and therefore sqlContext.cacheTable(...) is strongly recommended for this use case.". If this is no longer accurate, i could make a PR to remove it. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Get size of rdd in memory
Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the "branch-1.2" branch, https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md "Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will not be cached using the in-memory columnar format, and therefore sqlContext.cacheTable(...) is strongly recommended for this use case.". If this is no longer accurate, i could make a PR to remove it. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Get size of rdd in memory
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable| since Spark 1.2.0. The reason why your web UI didn’t show you the cached table is that both |cacheTable| and |sql("SELECT ...")| are lazy :-) Simply add a |.collect()| after the |sql(...)| call. Cheng On 2/2/15 12:23 PM, ankits wrote: Thanks for your response. So AFAICT calling parallelize(1 to1024).map(i =>KV(i, i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of the schemardd in memory and parallelize(1 to1024).map(i =>KV(i, i.toString)).cache().count() will show me the size of a regular rdd. But this will not show us the size when using cacheTable() right? Like if i call parallelize(1 to1024).map(i =>KV(i, i.toString)).toSchemaRDD.registerTempTable("test") sqc.cacheTable("test") sqc.sql("SELECT COUNT(*) FROM test") the web UI does not show us the size of the cached table. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Get size of rdd in memory
Thanks for your response. So AFAICT calling parallelize(1 to1024).map(i =>KV(i, i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of the schemardd in memory and parallelize(1 to1024).map(i =>KV(i, i.toString)).cache().count() will show me the size of a regular rdd. But this will not show us the size when using cacheTable() right? Like if i call parallelize(1 to1024).map(i =>KV(i, i.toString)).toSchemaRDD.registerTempTable("test") sqc.cacheTable("test") sqc.sql("SELECT COUNT(*) FROM test") the web UI does not show us the size of the cached table. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Get size of rdd in memory
Here is a toy |spark-shell| session snippet that can show the memory consumption difference: |import org.apache.spark.sql.SQLContext import sc._ val sqlContext = new SQLContext(sc) import sqlContext._ setConf("spark.sql.shuffle.partitions","1") case class KV(key:Int, value:String) parallelize(1 to1024).map(i =>KV(i, i.toString)).toSchemaRDD.cache().count() parallelize(1 to1024).map(i =>KV(i, i.toString)).cache().count() | You may see the result from the storage page of the web UI. It suggests the in-memory columnar version uses 11.6KB while the raw RDD version uses 76.6KB on my machine. Not quite sure how to do the comparison programmatically. You can track the data source of the “Size in Memory” field showed in the web UI storage tab. Cheng On 1/30/15 6:15 PM, ankits wrote: Hi, I want to benchmark the memory savings by using the in-memory columnar storage for schemardds (using cacheTable) vs caching the SchemaRDD directly. It would be really helpful to be able to query this from the spark-shell or jobs directly. Could a dev point me to the way to do this? From what I understand i will need a reference to the block manager, or something like RDDInfo.fromRdd(rdd).memSize. I could use reflection or whatever to override the private access modifiers. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Get size of rdd in memory
Hi, I want to benchmark the memory savings by using the in-memory columnar storage for schemardds (using cacheTable) vs caching the SchemaRDD directly. It would be really helpful to be able to query this from the spark-shell or jobs directly. Could a dev point me to the way to do this? From what I understand i will need a reference to the block manager, or something like RDDInfo.fromRdd(rdd).memSize. I could use reflection or whatever to override the private access modifiers. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org