How to debug Spark source using IntelliJ/ Eclipse

2015-12-05 Thread jatinganhotra
Hi,

I am trying to understand Spark internal code and wanted to debug Spark
source, to add a new feature. I have tried the steps lined out here on the 
Spark Wiki page IDE setup

 
, but they don't work.

I also found other posts in the Dev mailing list such as - 

1.  Spark-1-5-0-setting-up-debug-env

 
, and

2.  using-IntelliJ-to-debug-SPARK-1-1-Apps-with-mvn-sbt-for-beginners

  

But, I found many issues with both the links. I have tried both these
articles many times, often re-starting the whole process from scratch after
deleting everything and re-installing again, but I always face some
dependency issues.

It would be great if someone from the Spark developers group could point me
to the steps for setting up Spark debug environment.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-debug-Spark-source-using-IntelliJ-Eclipse-tp15477.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



How can I access data on RDDs?

2015-10-06 Thread jatinganhotra
Consider the following 2 scenarios:

*Scenario #1*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.checkpoint
pagecounts.count

*Scenario #2*
val pagecounts = sc.textFile("data/pagecounts")
pagecounts.count

The total time show in the Spark shell Application UI was different for both
scenarios. /Scenario #1 took 0.5 seconds, while scenario #2 took only 0.2
s/.

*Questions:*
1. I understand that scenario #1 is taking more time, because the RDD is
check-pointed (written to disk). Is there a way I can know the time taken
for checkpoint, from the total time?  

The Spark shell Application UI shows the following - Scheduler delay, Task
Deserialization time, GC time, Result serialization time, getting result
time. But, doesn't show the breakdown for checkpointing.  

2. Is there a way to access the above metrics e.g. scheduler delay, GC time
and save them programmatically? I want to log some of the above metrics for
every action invoked on an RDD.  

3. How can I programmatically access the following information:  
- Size of an RDD, when persisted to disk on checkpointing?  
- How much percentage of an RDD is in memory currently?  
- Overall time taken for computing an RDD?  

Please let me know if you need more information.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-can-I-access-data-on-RDDs-tp14475.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org