Re: Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-18 Thread abhiguruvayya
Hello Mayur,

#3 in the new RangePartitioner(*3*, partitionedFile); is also a hard coded
value for the number of partitions. Do you find a way where i can avoid
that. And including the cluster size, partitions depends on the input data
size also. Correct me if i am wrong.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214p12311.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-15 Thread abhiguruvayya
My use case as mentioned below.

1. Read input data from local file system using sparkContext.textFile(input
path).
2. partition the input data(80 million records) into partitions using
RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer
function. Without using coalesce() or repartition() on the input data spark
executes really slow and fails with out of memory exception.

The issue i am facing here is in deciding the number of partitions to be
applied on the input data. *The input data  size varies every time and hard
coding a particular value is not an option. And spark performs really well
only when certain optimum partition is applied on the input data for which i
have to perform lots of iteration(trial and error). Which is not an option
in a production environment.*

My question: Is there a thumb rule to decide the number of partitions
required depending on the input data size and cluster resources
available(executors,cores, etc...)? If yes please point me in that
direction. Any help  is much appreciated.

I am using spark 1.0 on yarn.

Thanks,
AG



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark job tracker.

2014-08-04 Thread abhiguruvayya
I am trying to create a asynchronous thread using Java executor service and
launching the javaSparkContext in this thread. But it is failing with exit
code 0(zero). I basically want to submit spark job in one thread and
continue doing something else after submitting. Any help on this? Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p11351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-24 Thread abhiguruvayya
Can any one help me understand the key difference between mapToPair vs
flatMapToPair vs flatMap functions and also when to apply these functions in
particular.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mapToPair-vs-flatMapToPair-vs-flatMap-function-usage-tp10617.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-07-23 Thread abhiguruvayya
Is there any thing equivalent to haddop Job
(org.apache.hadoop.mapreduce.Job;) in spark? Once i submit the spark job i
want to concurrently read the sparkListener interface implementation methods
where i can grab the job status. I am trying to find a way to wrap the spark
submit object into one thread and read the sparkListener interface
implementation methods in another thread.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p10548.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Need info on log4j.properties for apache spark.

2014-07-22 Thread abhiguruvayya
Hello All,

Basically i need to edit the log4j.properties to filter some of the
unnecessary logs in spark on yarn-client mode. I am not sure where can i
find log4j.properties file (location). Can any one help me on this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-info-on-log4j-properties-for-apache-spark-tp10431.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-07-22 Thread abhiguruvayya
I fixed the error with the yarn-client mode issue which i mentioned in my
earlier post. Now i want to edit the log4j.properties to filter some of the
unnecessary logs. Can you let me know where can i find this properties file.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p10433.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-07-22 Thread abhiguruvayya
Thanks i am able to load the file now. Can i turn off specific logs using
log4j.properties. I don't want to see the below logs. How can i do this.

14/07/22 14:01:24 INFO scheduler.TaskSetManager: Starting task 2.0:129 as
TID 129 on executor 3: ** (NODE_LOCAL)
14/07/22 14:01:24 INFO scheduler.TaskSetManager: Serialized task 2.0:129 as
14708 bytes in 0 ms

*current log4j.properties entry:*

# make a file appender and a console appender
# Print the date in ISO 8601 format
log4j.appender.myConsoleAppender=org.apache.log4j.ConsoleAppender
log4j.appender.myConsoleAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.myConsoleAppender.layout.ConversionPattern=%d [%t] %-5p %c -
%m%n
log4j.appender.myFileAppender=org.apache.log4j.RollingFileAppender
log4j.appender.myFileAppender.File=spark.log
log4j.appender.myFileAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.myFileAppender.layout.ConversionPattern=%d [%t] %-5p %c -
%m%n



# By default, everything goes to console and file
log4j.rootLogger=INFO, myConsoleAppender, myFileAppender

# The noisier spark logs go to file only
log4j.logger.spark.storage=INFO, myFileAppender
log4j.additivity.spark.storage=false
log4j.logger.spark.scheduler=INFO, myFileAppender
log4j.additivity.spark.scheduler=false
log4j.logger.spark.CacheTracker=INFO, myFileAppender
log4j.additivity.spark.CacheTracker=false
log4j.logger.spark.CacheTrackerActor=INFO, myFileAppender
log4j.additivity.spark.CacheTrackerActor=false
log4j.logger.spark.MapOutputTrackerActor=INFO, myFileAppender
log4j.additivity.spark.MapOutputTrackerActor=false
log4j.logger.spark.MapOutputTracker=INFO, myFileAppender
log4j.additivty.spark.MapOutputTracker=false



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p10440.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-07-21 Thread abhiguruvayya
Hello Marcelo Vanzin,

Can you explain bit more on this? I tried using client mode but can you
explain how can i use this port to write the log or output to this
port?Thanks in advance!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p10287.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-07-21 Thread abhiguruvayya
An also i am facing one issue. If i run the program in yarn-cluster mode it
works absolutely fine but if i change it to yarn-client mode i get this
below error.

Application application_1405471266091_0055 failed 2 times due to AM
Container for appattempt_1405471266091_0055_02 exited with exitCode:
-1000 due to: File does not exist:
/user/hadoop/.sparkStaging/application_1405471266091_0055/commons-math3-3.0.jar

I have commons-math3-3.0.jar in the class path and i am loading it to
staging also.

yarn.Client: Uploading file:***/commons-math3-3.0.jar to
***/user/hadoop/.sparkStaging/application_1405471266091_0055/commons-math3-3.0.jar.
from the logs.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p10343.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-07-10 Thread abhiguruvayya
Hi Mayur, Thanks so much for the explanation. It did help me. Is there a way
i can log these details on the console rather than logging it. As of now
once i start my application i could see this, 

14/07/10 00:48:20 INFO yarn.Client: Application report from ASM: 
 application identifier: application_1403538869175_0274
 appId: 274
 clientToAMToken: null
 appDiagnostics: 
 appMasterHost: ***
 appQueue: default
 appMasterRpcPort: 0
 appStartTime: 1404978412354
 yarnAppState: RUNNING
 distributedFinalState: UNDEFINED
 appTrackingUrl: ***
 appUser: hadoop

Which port do i have to hook in to write my statistics details rather than
this default output.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p9315.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-07-08 Thread abhiguruvayya
Hello Mayur,

How can I implement these methods mentioned below. Do u you have any clue on
this pls et me know.


public void onJobStart(SparkListenerJobStart arg0) {
}

@Override
public void onStageCompleted(SparkListenerStageCompleted arg0) {
}

@Override
public void onStageSubmitted(SparkListenerStageSubmitted arg0) {

}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p9104.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-07-02 Thread abhiguruvayya
Spark displays job status information on port 4040 using JobProgressListener,
any one knows how to hook into this port and read the details?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8697.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-06-27 Thread abhiguruvayya
Hello Mayur,

Are you using SparkListener interface java API? I tried using it but was
unsuccessful. So need few more inputs.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8438.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark job tracker.

2014-06-27 Thread abhiguruvayya
I know this is a very trivial question to ask but I'm a complete new bee to
this stuff so i don't have ne clue on this. Any help is much appreciated. 

For example if i have a class like below, and when i run this through
command line i want to see progress status. some thing like,

10% completed...
30% completed...
100% completed...Job done!

I am using spark 1.0 on yarn and using Java API.

public class MyJavaWordCount {
  public static void main(String[] args) throws Exception {
if (args.length  2) {
  System.err.println(Usage: MyJavaWordCount master file);
  System.exit(1);
}

System.out.println(args[0]: master=+args[0]);
System.out.println(args[1]: file=+args[1]);

JavaSparkContext ctx = new JavaSparkContext(
args[0], 
MyJavaWordCount,
System.getenv(SPARK_HOME), 
System.getenv(SPARK_EXAMPLES_JAR));
JavaRDDString lines = ctx.textFile(args[1], 1);

//  outputinput   output

JavaRDDString words = lines.flatMap(new FlatMapFunctionString,
String() {
  //  output   input
  public IterableString call(String s) {
return Arrays.asList(s.split( ));
  }
});

//  K   V   
input   K   V
JavaPairRDDString, Integer ones = words.mapToPair(new
PairFunctionString, String, Integer() {
  //K   V input
  public Tuple2String, Integer call(String s) {
//K   V
return new Tuple2String, Integer(s, 1);
  }
});

JavaPairRDDString, Integer counts = ones.reduceByKey(new
Function2Integer, Integer, Integer() {
  public Integer call(Integer i1, Integer i2) {
return i1 + i2;
  }
});

ListTuple2lt;String, Integer output = counts.collect();
for (Tuple2 tuple : output) {
  System.out.println(tuple._1 + :  + tuple._2);
}
System.exit(0);
  }
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-tracker-tp8367p8472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread abhiguruvayya
Any inputs on this will be helpful.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread abhiguruvayya
Does JavaPairRDD.saveAsHadoopFile store data as a sequenceFile? Then what is
the significance of RDD.saveAsSequenceFile?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7983.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread abhiguruvayya
I want to store JavaRDD as a sequence file instead of textfile. But i don't
see any Java API for that. Is there a way for this? Please let me know.
Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-19 Thread abhiguruvayya
Once you have generated the final RDD before submitting it to reducer try to
repartition the RDD either using coalesce(partitions) or repartition() into
known partitions. 2. Rule of thumb to create number of data partitions (3 *
num_executors * cores_per_executor). 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-1-java-lang-outOfMemoryError-Java-Heap-Space-tp7861p7970.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-19 Thread abhiguruvayya
No.  My understanding by reading the code is that RDD.saveAsObjectFile uses
Java Serialization and RDD.saveAsSequenceFile uses Writable which is tied to
the Writable Serialization framework in HDFS. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7973.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 1.0.0 java.lang.outOfMemoryError: Java Heap Space

2014-06-17 Thread abhiguruvayya
Try repartitioning the RDD using coalsce(int partitions) before performing
any transforms.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-java-lang-outOfMemoryError-Java-Heap-Space-tp7735p7736.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I am creating around 10 executors with 12 cores and 7g memory, but when i
launch a task not all executors are being used. For example if my job has 9
tasks, only 3 executors are being used with 3 task each and i believe this
is making my app slower than map reduce program for the same use case. Can
any one throw some light on executor configuration if any?How can i use all
the executors. I am running spark on yarn and Hadoop 2.4.0.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
Can some one help me with this. Any help is appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I did try creating more partitions by overriding the default number of
partitions determined by HDFS splits. Problem is, in this case program will
run for ever. I have same set of inputs for map reduce and spark. Where map
reduce is taking 2 mins, spark is taking 5 min to complete the job. I
thought because all of the executors are not being utilized properly my
spark program is running slower than map reduce. I can provide you my code
skeleton for your reference. Please help me with this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
I found the main reason to be that i was using coalesce instead of
repartition. coalesce was shrinking the portioning so the number of tasks
were very less to be executed by all of the executors. Can you help me in
understudying when to use coalesce and when to use repartition. In
application coalesce is being processed faster then repartition. Which is
unusual.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Executors not utilized properly.

2014-06-17 Thread abhiguruvayya
Perfect!! That makes so much sense to me now. Thanks a ton



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7793.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.