Re: Topic Modelling- LDA

2015-09-23 Thread Sameer Farooqui
Hi Subshri, You may find these 2 blog posts useful: https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html https://databricks.com/blog/2015/09/22/large-scale-topic-modeling-improvements-to-lda-on-spark.html On Tue, Sep 22, 2015 at 11:54 PM, Subshiri S

Re: SparkSQL concerning materials

2015-08-21 Thread Sameer Farooqui
Have you seen the Spark SQL paper?: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf On Thu, Aug 20, 2015 at 11:35 PM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: Hi, thanks for answers. I have read answers you provided, but I rather look for some materials on the

Re: Data locality with HDFS not being seen

2015-08-21 Thread Sameer Farooqui
Hi Sunil, Have you seen this fix in Spark 1.5 that may fix the locality issue?: https://issues.apache.org/jira/browse/SPARK-4352 On Thu, Aug 20, 2015 at 4:09 AM, Sunil sdhe...@gmail.com wrote: Hello . I am seeing some unexpected issues with achieving HDFS data locality. I expect the

Re: Caching and Actions

2015-04-09 Thread Sameer Farooqui
Your point #1 is a bit misleading. (1) The mappers are not executed in parallel when processing independently the same RDD. To clarify, I'd say: In one stage of execution, when pipelining occurs, mappers are not executed in parallel when processing independently the same RDD partition. On Thu,

Re: Apache Spark Executor - number of threads

2015-03-17 Thread Sameer Farooqui
Hi Igor Nirandap, There is a setting in Spark called cores or num_cores that you should look into. This # will set the # of threads running in each Executor JVM. The name of the setting is a bit misleading. You don't have to match the num_cores of the Executor to the actual number of CPU cores

Re: Executor lost with too many temp files

2015-02-23 Thread Sameer Farooqui
Hi Marius, Are you using the sort or hash shuffle? Also, do you have the external shuffle service enabled (so that the Worker JVM or NodeManager can still serve the map spill files after an Executor crashes)? How many partitions are in your RDDs before and after the problematic shuffle

Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
pradhandeep1...@gmail.com wrote: How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com

Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a

Re: Missing shuffle files

2015-02-22 Thread Sameer Farooqui
Do you guys have dynamic allocation turned on for YARN? Anders, was Task 450 in your job acting like a Reducer and fetching the Map spill output data from a different node? If a Reducer task can't read the remote data it needs, that could cause the stage to fail. Sometimes this forces the

Re: pyspark is crashing in this case. why?

2014-12-15 Thread Sameer Farooqui
TaskSetManager: Stage 1 contains a task of very large size (9766 KB). The maximum recommended task size is 100 KB. [1, 2, 3, 4, 5, 6, 7, 8, 9] On Mon, Dec 15, 2014 at 1:33 PM, Sameer Farooqui same...@databricks.com wrote: Hi Genesis, The 2nd case did work for me: a = [1,2,3,4,5,6,7,8,9

Re: pyspark is crashing in this case. why?

2014-12-14 Thread Sameer Farooqui
How much executor-memory are you setting for the JVM? What about the Driver JVM memory? Also check the Windows Event Log for Out of memory errors for one of the 2 above JVMs. On Dec 14, 2014 6:04 AM, genesis fatum genesis.fa...@gmail.com wrote: Hi, My environment is: standalone spark 1.1.1 on

Re: how to convert an rdd to a single output file

2014-12-12 Thread Sameer Farooqui
Instead of doing this on the compute side, I would just write out the file with different blocks initially into HDFS and then use hadoop fs -getmerge or HDFSConcat to get one final output file. - SF On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis lordjoe2...@gmail.com wrote: I have an RDD

Re: resource allocation spark on yarn

2014-12-12 Thread Sameer Farooqui
Hi, FYI - There are no Worker JVMs used when Spark is launched under YARN. Instead the NodeManager in YARN does what the Worker JVM does in Spark Standalone mode. For YARN you'll want to look into the following settings: --num-executors: controls how many executors will be allocated

Re: broadcast: OutOfMemoryError

2014-12-11 Thread Sameer Farooqui
Is the OOM happening to the Driver JVM or one of the Executor JVMs? What memory size is each JVM? How large is the data you're trying to broadcast? If it's large enough, you may want to consider just persisting the data to distributed storage (like HDFS) and read it in through the normal read RDD

Re: spark-submit on YARN is slow

2014-12-05 Thread Sameer Farooqui
Just an FYI - I can submit the SparkPi app to YARN in cluster mode on a 1-node m3.xlarge EC2 instance instance and the app finishes running successfully in about 40 seconds. I just figured the 30 - 40 sec run time was normal b/c of the submitting overhead that Andrew mentioned. Denny, you can

Re: Java RDD Union

2014-12-05 Thread Sameer Farooqui
Hi Ron, Out of curiosity, why do you think that union is modifying an existing RDD in place? In general all transformations, including union, will create new RDDs, not modify old RDDs in place. Here's a quick test: scala val firstRDD = sc.parallelize(1 to 5) firstRDD:

Re: Any ideas why a few tasks would stall

2014-12-04 Thread Sameer Farooqui
% On Tue, Dec 2, 2014 at 3:43 PM, Sameer Farooqui same...@databricks.com wrote: Have you tried taking thread dumps via the UI? There is a link to do so on the Executors' page (typically under http://driver IP:4040/exectuors. By visualizing the thread call stack of the executors with slow running

Re: Necessity for rdd replication.

2014-12-04 Thread Sameer Farooqui
In general, most use cases don't need the RDD to be replicated in memory multiple times. It would be a rare exception to do this. If it's really expensive (time consuming) to recomputing a lost partition or if the use case is extremely time sensitive, then maybe you could replicate it in memory.

Re: Monitoring Spark

2014-12-04 Thread Sameer Farooqui
Are you running Spark in Local or Standalone mode? In either mode, you should be able to hit port 4040 (to see the Spark Jobs/Stages/Storage/Executors UI) on the machine where the driver is running. However, in local mode, you won't have a Spark Master UI on 7080 or a Worker UI on 7081. You can

Re: Any ideas why a few tasks would stall

2014-12-02 Thread Sameer Farooqui
Have you tried taking thread dumps via the UI? There is a link to do so on the Executors' page (typically under http://driver IP:4040/exectuors. By visualizing the thread call stack of the executors with slow running tasks, you can see exactly what code is executing at an instant in time. If you

Re: Spark setup on local windows machine

2014-11-25 Thread Sameer Farooqui
Hi Sunita, This gitbook may also be useful for you to get Spark running in local mode on your Windows machine: http://blueplastic.gitbooks.io/how-to-light-your-spark-on-a-stick/content/ On Tue, Nov 25, 2014 at 11:09 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You could try following this

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui
Hi Shahab, Are you running Spark in Local, Standalone, YARN or Mesos mode? If you're running in Standalone/YARN/Mesos, then the .count() action is indeed automatically parallelized across multiple Executors. When you run a .count() on an RDD, it is actually distributing tasks to different

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui
By the way, in case you haven't done so, do try to .cache() the RDD before running a .count() on it as that could make a big speed improvement. On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com wrote: Hi Shahab, Are you running Spark in Local, Standalone, YARN

Re: SparkContext UI

2014-10-30 Thread Sameer Farooqui
Hey Stuart, The RDD won't show up under the Storage tab in the UI until it's been cached. Basically Spark doesn't know what the RDD will look like until it's cached, b/c up until then the RDD is just on disk (external to Spark). If you launch some transformations + an action on an RDD that is

Re: SparkContext UI

2014-10-30 Thread Sameer Farooqui
the trigger on my original email. I should have added that I'm tried using persist() and cache() but no joy. I'm doing this: data = sc.textFile(somedata) data.cache data.count() but I still can't see anything in the storage? On 31 October 2014 10:42, Sameer Farooqui same...@databricks.com

Re: spark-submit memory too larger

2014-10-24 Thread Sameer Farooqui
That does seem a bit odd. How many Executors are running under this Driver? Does the spark-submit process start out using ~60GB of memory right away or does it start out smaller and slowly build up to that high? If so, how long does it take to get that high? Also, which version of Spark are you

Re: Spark Streaming Applications

2014-10-22 Thread Sameer Farooqui
Hi Saiph, Patrick McFadin and Helena Edelson from DataStax taught a tutorial at NYC Strata last week where they created a prototype Spark Streaming + Kafka application for time series data. You can see the code here: https://github.com/killrweather/killrweather On Tue, Oct 21, 2014 at 4:33 PM,

Re: Setting only master heap

2014-10-22 Thread Sameer Farooqui
Hi Keith, Would be helpful if you could post the error message. Are you running Spark in Standalone mode or with YARN? In general, the Spark Master is only used for scheduling and it should be fine with the default setting of 512 MB RAM. Is it actually the Spark Driver's memory that you

Re: spark ui redirecting to port 8100

2014-10-21 Thread Sameer Farooqui
Hi Sadhan, Which port are you specifically trying to redirect? The driver program has a web UI, typically on port 4040... or the Spark Standalone Cluster Master has a UI exposed on port 7077. Which setting did you update in which file to make this change? And finally, which version of Spark are

Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Sameer Farooqui
Hi Shailesh, Spark just leverages the Hadoop File Output Format to write out the RDD you are saving. This is really a Hadoop OutputFormat limitation which requires the directory it is writing into to not exist. The idea is that a Hadoop job should not be able to overwrite the results from a