RE: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-08 Thread Puneet Kumar Ojha
Thanks

From: Nick Pentreath [mailto:nick.pentre...@gmail.com]
Sent: Tuesday, April 07, 2015 5:52 PM
To: Puneet Kumar Ojha
Cc: user@spark.apache.org
Subject: Re: Difference between textFile Vs hadoopFile (textInoutFormat) on 
HDFS data

There is no difference - textFile calls hadoopFile with a TextInputFormat, and 
maps each value to a String.

—
Sent from Mailboxhttps://www.dropbox.com/mailbox


On Tue, Apr 7, 2015 at 1:46 PM, Puneet Kumar Ojha 
puneet.ku...@pubmatic.commailto:puneet.ku...@pubmatic.com wrote:
Hi ,


Is there any difference between Difference between textFile Vs hadoopFile 
(textInoutFormat) when data is present in HDFS? Will there be any performance 
gain that can be observed?


Puneet Kumar Ojha
Data Architect | PubMatichttp://www.pubmatic.com/





Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-07 Thread Puneet Kumar Ojha
Hi ,

Is there any difference between Difference between textFile Vs hadoopFile 
(textInoutFormat) when data is present in HDFS? Will there be any performance 
gain that can be observed?

Puneet Kumar Ojha
Data Architect | PubMatichttp://www.pubmatic.com/



Spark Web UI Doesn't Open in Yarn-Client Mode

2015-02-14 Thread Puneet Kumar Ojha
Hi,

I am running 3 mode spark cluster on EMR. While running job I see 1 executor 
running? Does that mean only 1 of the node is being used? ( Seems from Spark 
Documentation on default mode (LOCAL).

When I switch to yarn-client mode the Spark Web UI doesn't open. How to view 
the job running details now ?





RE: Tuning number of partitions per CPU

2015-02-13 Thread Puneet Kumar Ojha
Use below configuration if u r using 1.2 version:-

SET spark.shuffle.consolidateFiles=true;
SET spark.rdd.compress=true;
SET spark.default.parallelism=1000;
SET spark.deploy.defaultCores=54;

Thanks
Puneet.

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Friday, February 13, 2015 4:46 PM
To: Igor Petrov
Cc: user@spark.apache.org
Subject: Re: Tuning number of partitions per CPU

18 cores or 36? doesn't probably matter.
For this case where you have some overhead per partition of setting up the DB 
connection, it may indeed not help to chop up the data more finely than your 
total parallelism. Although that would imply quite an overhead. Are you doing 
any other expensive initialization per partition in your code?
You might check some other basic things, like, are you bottlenecked on the DB 
(probably not) and are there task stragglers drawing out the completion time.

On Fri, Feb 13, 2015 at 11:06 AM, Igor Petrov igorpetrov...@gmail.com wrote:
 Hello,

 In Spark programming guide
 (http://spark.apache.org/docs/1.2.0/programming-guide.html) there is a
 recommendation:
 Typically you want 2-4 partitions for each CPU in your cluster.

 We have a Spark Master and two Spark workers each with 18 cores and 18 
 GB of RAM.
 In our application we use JdbcRDD to load data from a DB and then cache it.
 We load entities from a single table, now we have 76 million of 
 entities (entity size in memory is about 160 bytes). We call count() 
 during application startup to force entities loading. Here are our 
 measurements for
 count() operation (cores x partitions = time):
 36x36 = 6.5 min
 36x72 = 7.7 min
 36x108 = 9.4 min

 So despite recommendations the most efficient setup is one partition 
 per core. What is the reason for above recommendation?

 Java 8, Apache Spark 1.1.0




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Tuning-number-of-p
 artitions-per-CPU-tp21642.html Sent from the Apache Spark User List 
 mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org