Re: Get size of rdd in memory

2015-01-30 Thread Cheng Lian
Here is a toy |spark-shell| session snippet that can show the memory 
consumption difference:


|import  org.apache.spark.sql.SQLContext
import  sc._

val  sqlContext  =  new  SQLContext(sc)
import  sqlContext._

setConf("spark.sql.shuffle.partitions","1")

case  class  KV(key:Int, value:String)

parallelize(1  to1024).map(i =>KV(i, i.toString)).toSchemaRDD.cache().count()
parallelize(1  to1024).map(i =>KV(i, i.toString)).cache().count()
|

You may see the result from the storage page of the web UI. It suggests 
the in-memory columnar version uses 11.6KB while the raw RDD version 
uses 76.6KB on my machine.


Not quite sure how to do the comparison programmatically. You can track 
the data source of the “Size in Memory” field showed in the web UI 
storage tab.


Cheng

On 1/30/15 6:15 PM, ankits wrote:


Hi,

I want to benchmark the memory savings by using the in-memory columnar
storage for schemardds (using cacheTable) vs caching the SchemaRDD directly.
It would be really helpful to be able to query this from the spark-shell or
jobs directly. Could a dev point me to the way to do this? From what I
understand i will need a reference to the block manager, or something like
RDDInfo.fromRdd(rdd).memSize.

I could use reflection or whatever to override the private access modifiers.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



​


Get size of rdd in memory

2015-01-30 Thread ankits
Hi,

I want to benchmark the memory savings by using the in-memory columnar
storage for schemardds (using cacheTable) vs caching the SchemaRDD directly.
It would be really helpful to be able to query this from the spark-shell or
jobs directly. Could a dev point me to the way to do this? From what I
understand i will need a reference to the block manager, or something like
RDDInfo.fromRdd(rdd).memSize.

I could use reflection or whatever to override the private access modifiers.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-30 Thread Marcelo Vanzin
+1 (non-binding)

Ran spark-shell and Scala jobs on top of yarn (using the hadoop-2.4 tarball).

There's a very slight behavioral change in the API. This code now throws an NPE:

  new SparkConf().setIfMissing("foo", null)

It worked before. It's probably fine, though, since `SparkConf.set`
would throw an NPE before for the same arguments, so it's unlikely
anyone was relying on that behavior.


On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.2.1!
>
> The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1062/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>
> Changes from rc1:
> This has no code changes from RC1. Only minor changes to the release script.
>
> Please vote on releasing this package as Apache Spark 1.2.1!
>
> The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.1
> [ ] -1 Do not release this package because ...
>
> For a list of fixes in this release, see http://s.apache.org/Mpn.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org