Re:

2015-06-25 Thread Akhil Das
Look in the tuning section https://spark.apache.org/docs/latest/tuning.html, also you need to figure out whats taking time and where's your bottleneck etc. If everything is tuned properly, then you will need to throw more cores :) Thanks Best Regards On Thu, Jun 25, 2015 at 12:19 AM, ÐΞ€ρ@Ҝ

Re: bugs in Spark PageRank implementation

2015-06-25 Thread Sean Owen
#2 is not a bug. Have a search through JIRA. It is merely unformalized. I think that is how (one of?) the original PageRank papers does it. On Thu, Jun 25, 2015, 7:39 AM Kelly, Terence P (HP Labs Researcher) terence.p.ke...@hp.com wrote: Hi, Colleagues and I have found that the PageRank

Re: Scala/Python or Java

2015-06-25 Thread Ted Yu
The answer depends on the user's experience with these languages as well as the most commonly used language in the production environment. Learning Scala requires some time. If you're very comfortable with Java / Python, you can go with that while at the same time familiarizing yourself with

sparkR could not find function textFile

2015-06-25 Thread Wei Zhou
Hi all, I am exploring sparkR by activating the shell and following the tutorial here https://amplab-extras.github.io/SparkR-pkg/ And when I tried to read in a local file with textFile(sc, file_location), it gives an error could not find function textFile. By reading through sparkR doc for 1.4,

sql dataframe internal representation

2015-06-25 Thread Koert Kuipers
i noticed in DataFrame that to get the rdd out of it some conversions are done: val converter = CatalystTypeConverters.createToScalaConverter(schema) rows.map(converter(_).asInstanceOf[Row]) does this mean DataFrame internally does not use the standard scala types? why not?

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-25 Thread Ayman Farahat
was there any resolution to that problem? I am also having that with Pyspark 1.4 380 Million observations 100 factors and 5 iterations Thanks Ayman On Jun 23, 2015, at 6:20 PM, Xiangrui Meng men...@gmail.com wrote: It shouldn't be hard to handle 1 billion ratings in 1.3. Just need more

SparkSQL - understanding Cross Joins

2015-06-25 Thread Night Wolf
Hi guys, I'm trying to do a cross join (cartesian product) with 3 tables stored as parquet. Each table has 1 column, a long key. Table A has 60,000 keys with 1000 partitions Table B has 1000 keys with 1 partition Table C has 4 keys with 1 partition The output should be 240million row

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
Hi, This note only speaks of Spark 1.2, is only applicable to Spark on Windows and it's not possible to use the Thrift server so I was looking for a better way to have Spark on Azure. Thanks, Daniel On 26 ביוני 2015, at 01:38, Jacob Kim jac...@microsoft.com wrote: Below is the link for step

ALS :how to set numUserBlocks and numItemBlocks

2015-06-25 Thread afarahat
any guidance how to set these 2? I have way more users (100s of millions than items) Thanks Ayman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ALS-how-to-set-numUserBlocks-and-numItemBlocks-tp23503.html Sent from the Apache Spark User List mailing list

Re: Killing Long running tasks (stragglers)

2015-06-25 Thread William Ferrell
Thank you for your reply Akhil! Here is an example of the script that we are using: https://gist.github.com/wasauce/40f3350c1a110e5cef1c Any pointers would be very helpful. Best, - Bill On Thu, Jun 25, 2015 at 2:03 AM, Akhil Das ak...@sigmoidanalytics.com wrote: That totally depends on the

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
Then how performance of mapPartitions is faster than map? On Thu, Jun 25, 2015 at 6:40 PM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Spark creates a RecordReader and uses next() on it when you call input.next(). (See

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Sean Owen
Hm that looks like a Parquet version mismatch then. I think Spark 1.4 uses 1.6? You might well get away with 1.6 here anyway. On Thu, Jun 25, 2015 at 3:13 PM, Aaron aarongm...@gmail.com wrote: Sorry about not suppling the error..that would make things helpful you'd think :) [INFO]

Re: java.lang.OutOfMemoryError: PermGen space

2015-06-25 Thread Roberto Coluccio
Glad it worked! Actually I got similar issues even with Spark Streaming v1.2.x based drivers. Think also that the default config in Spark on EMR is 512m ! Roberto On Thu, Jun 25, 2015 at 1:20 AM, Srikanth srikanth...@gmail.com wrote: That worked. Thanks! I wonder what changed in 1.4 to

Re: Debugging Apache Spark clustered application from Eclipse

2015-06-25 Thread Yana Kadiyska
Pass that debug string to your executor like this: --conf spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address= 7761. When your executor is launched it will send debug information on port 7761. When you attach the Eclipse debugger, you need to have the

Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Eskilson,Aleksander
Hi there, Parallelize is part of the RDD API which was made private for Spark v. 1.4.0. Some functions in the RDD API were considered too low-level to expose, so only most of the DataFrame API is currently public. The original rationale for this decision can be found on the issue's JIRA [1]. The

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Aaron
Yep! That was it. Using the parquet.version1.6.0rc3/parquet.version that comes with spark, rather than using the 1.5.0-cdh5.4.2 version. Thanks for the help! Cheers, Aaron On Thu, Jun 25, 2015 at 8:24 AM, Sean Owen so...@cloudera.com wrote: Hm that looks like a Parquet version

Re: map vs mapPartitions

2015-06-25 Thread Hao Ren
It's not the number of executors that matters, but the # of the CPU cores of your cluster. Each partition will be loaded on a core for computing. e.g. A cluster of 3 nodes has 24 cores, and you divide the RDD in 24 partitions (24 tasks for narrow dependency). Then all the 24 partitions will be

How to create correct data frame for classification in Spark ML?

2015-06-25 Thread dusan
Hi, I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample data: age,hours_per_week,education,sex,salaryRange 38,40,hs-grad,male,A 28,40,bachelors,female,A 52,45,hs-grad,male,B

Re: Compiling Spark 1.4 (and/or Spark 1.4.1-rc1) with CDH 5.4.1/2

2015-06-25 Thread Aaron
Sorry about not suppling the error..that would make things helpful you'd think :) [INFO] [INFO] Building Spark Project SQL 1.4.1 [INFO] [INFO]

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
say source is HDFS,And file is divided in 10 partitions. so what will be input contains. public IterableInteger call(IteratorString input) say I have 10 executors in job each having single partition. will it have some part of partition or complete. And if some when I call input.next() - it

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Eskilson,Aleksander
The simple answer is that SparkR does support map/reduce operations over RDD’s through the RDD API, but since Spark v 1.4.0, those functions were made private in SparkR. They can still be accessed by prepending the function with the namespace, like SparkR:::lapply(rdd, func). It was thought

Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-25 Thread Roman Sokolov
Hello! I am trying to compute number of triangles with GraphX. But get memory error or heap size, even though the dataset is very small (1Gb). I run the code in spark-shell, having 16Gb RAM machine (also tried with 2 workers on separate machines 8Gb RAM each). So I have 15x more memory than the

Re: map vs mapPartitions

2015-06-25 Thread Daniel Darabos
Spark creates a RecordReader and uses next() on it when you call input.next(). (See https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L215) How the RecordReader works is an HDFS question, but it's safe to say there is no difference between using

Re: Spark 1.4.0, Secure YARN Cluster, Application Master throws 500 connection refused (Resolved)

2015-06-25 Thread Nachiketa
Setting the yarn.resourcemanager.webapp.address.rm1 and yarn.resourcemanager.webapp.address.rm2 in yarn-site.xml seems to have resolved the issue. Appreciate any comments about the regression from 1.3.1 ? Thanks. Regards, Nachiketa On Fri, Jun 26, 2015 at 1:28 AM, Nachiketa

Re: Scala/Python or Java

2015-06-25 Thread Saurabh Agrawal
Greetings, Even I am a beginner and currently learning Spark. I found Python + Spark combination to be easiest to learn given my past experience with Python, but yes, it depends on the user. Here is some reference documentation: https://spark.apache.org/docs/latest/programming-guide.html

Re: sparkR could not find function textFile

2015-06-25 Thread Eskilson,Aleksander
Hi there, The tutorial you’re reading there was written before the merge of SparkR for Spark 1.4.0 For the merge, the RDD API (which includes the textFile() function) was made private, as the devs felt many of its functions were too low level. They focused instead on finishing the DataFrame

Re: Parquet problems

2015-06-25 Thread Anders Arpteg
Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause failure after a number of reads. There are about 700 different data sources that needs to be loaded, lots of data... tor 25 jun 2015 08:02 Sabarish Sasidharan

Re: Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-25 Thread Steve Loughran
you are using a guava version on the classpath which your version of Hadoop can't handle. try a version 15 or build spark against Hadoop 2.7.0 On 24 Jun 2015, at 19:03, maxdml max...@cs.duke.edu wrote: Exception in thread main java.lang.NoSuchMethodError:

Re: Loss of data due to congestion

2015-06-25 Thread ayan guha
Then you should see checkpointing ( https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing ) On Thu, Jun 25, 2015 at 3:33 PM, anshu shukla anshushuk...@gmail.com wrote: Thaks, I am talking about streaming. On 25 Jun 2015 5:37 am, ayan guha guha.a...@gmail.com

Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Marcelo Vanzin
Please take a look at the pull request with the actual fix; that will explain why it's the same issue. On Thu, Jun 25, 2015 at 12:51 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Thanks Marcelo. But my case is different. My mypython/libs/numpy-1.9.2.zip is in *local directory* (can also

Re: Scala/Python or Java

2015-06-25 Thread ayan guha
I am a python fan so I use python. But what I noticed some features are typically 1-2 release behind for python. So I strongly agree with Ted that start with language you are most familiar with and plan to move to scala eventually On 26 Jun 2015 06:07, Ted Yu yuzhih...@gmail.com wrote: The

Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Elkhan Dadashov
Thanks Marcelo. But my case is different. My mypython/libs/numpy-1.9.2.zip is in *local directory* (can also put in HDFS), but still fails. But SPARK-5479 https://issues.apache.org/jira/browse/SPARK-5479 is : PySpark on yarn mode need to support *non-local* python files. The job fails only when

Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Naveen Madhire
Hi Marcelo, Quick Question. I am using Spark 1.3 and using Yarn Client mode. It is working well, provided I have to manually pip-install all the 3rd party libraries like numpy etc to the executor nodes. So the SPARK-5479 fix in 1.5 which you mentioned fix this as well? Thanks. On Thu, Jun

Re: Spark 1.4.0, Secure YARN Cluster, Application Master throws 500 connection refused

2015-06-25 Thread Nachiketa
A few other observations. 1. Spark 1.3.1 (custom built against HDP 2.2) was running fine against the same cluster and same hadoop configuration (hence seems like regression). 2. HA is enabled for YARN RM and HDFS (not sure if this would impact anything but wanted to share anyway). 3. Found this

Re: sparkR could not find function textFile

2015-06-25 Thread Wei Zhou
Hi Alek, Thanks for the explanation, it is very helpful. Cheers, Wei 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander alek.eskil...@cerner.com: Hi there, The tutorial you’re reading there was written before the merge of SparkR for Spark 1.4.0 For the merge, the RDD API (which includes the

Re: sparkR could not find function textFile

2015-06-25 Thread Wei Zhou
Hi Alek, Just a follow up question. This is what I did in sparkR shell: lines - SparkR:::textFile(sc, ./README.md) head(lines) And I am getting error: Error in x[seq_len(n)] : object of type 'S4' is not subsettable I'm wondering what did I do wrong. Thanks in advance. Wei 2015-06-25 13:44

Re: sparkR could not find function textFile

2015-06-25 Thread Shivaram Venkataraman
The `head` function is not supported for the RRDD that is returned by `textFile`. You can run `take(lines, 5L)`. I should add a warning here that the RDD API in SparkR is private because we might not support it in the upcoming releases. So if you can use the DataFrame API for your application you

Re: sparkR could not find function textFile

2015-06-25 Thread Eskilson,Aleksander
Yeah, that’s probably because the head() you’re invoking there is defined for SparkR DataFrames [1] (note how you don’t have to use the SparkR::: namepsace in front of it), but SparkR:::textFile() returns an RDD object, which is more like a distributed list data structure the way you’re

Re: Scala/Python or Java

2015-06-25 Thread spark user
Spark is based on Scala and it written in Scala .To debug and fix issue i guess learning Scala is good  for long term ? any advise ? On Thursday, June 25, 2015 1:26 PM, ayan guha guha.a...@gmail.com wrote: I am a python fan so I use python. But what I noticed some features are

Re: sparkR could not find function textFile

2015-06-25 Thread Wei Zhou
Thanks to both Shivaram and Alek. Then if I want to create DataFrame from comma separated flat files, what would you recommend me to do? One way I can think of is first reading the data as you would do in r, using read.table(), and then create spark DataFrame out of that R dataframe, but it is

Re: sparkR could not find function textFile

2015-06-25 Thread Shivaram Venkataraman
You can use the Spark CSV reader to do read in flat CSV files to a data frame. See https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85 for an example Shivaram On Thu, Jun 25, 2015 at 2:15 PM, Wei Zhou zhweisop...@gmail.com wrote: Thanks to both Shivaram and Alek. Then if I want to create

Re: sparkR could not find function textFile

2015-06-25 Thread Eskilson,Aleksander
Sure, I had a similar question that Shivaram was able fast for me, the solution is implemented using a separate DataBrick’s library. Check out this thread from the email archives [1], and the read.df() command [2]. CSV files can be a bit tricky, especially with inferring their schemas. Are you

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to read chunk

2015-06-25 Thread Josh Rosen
Which Spark version are you using? AFAIK the corruption bugs in sort-based shuffle should have been fixed in newer Spark releases. On Wed, Jun 24, 2015 at 12:25 PM, Piero Cinquegrana pcinquegr...@marketshare.com wrote: Switching spark.shuffle.manager from sort to hash fixed this issue as

Re: Spark Meetup Istanbul

2015-06-25 Thread ayan guha
BTW is there active spark community around Melbourne? Kindly ping me if any enthusiast wants to partner with me to create one... On 26 Jun 2015 00:17, Şafak Serdar Kapçı sska...@gmail.com wrote: Hello, I create a Meetup and Linkedin group in Istanbul. If it is possible can you add to list as

Re: map vs mapPartitions

2015-06-25 Thread Shushant Arora
yes, 1 partition per core and mapPartitions apply function on each partition. Question is Does complete partition loads in memory so that function can be applied to it or its an iterator and iterator.next() loads next record and if yes then how is it efficient than map which also works on 1

Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Felix C
Thanks! It's good to know --- Original Message --- From: Eskilson,Aleksander alek.eskil...@cerner.com Sent: June 25, 2015 5:57 AM To: Felix C felixcheun...@hotmail.com, user@spark.apache.org Subject: Re: SparkR parallelize not found with 1.4.1? Hi there, Parallelize is part of the RDD API

Re: Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-25 Thread Max Demoulin
I see, thank you! -- Henri Maxime Demoulin 2015-06-25 5:54 GMT-04:00 Steve Loughran ste...@hortonworks.com: you are using a guava version on the classpath which your version of Hadoop can't handle. try a version 15 or build spark against Hadoop 2.7.0 On 24 Jun 2015, at 19:03, maxdml

Re: Spark Meetup Istanbul

2015-06-25 Thread Paco Nathan
Hi Ayan, Yes, there is -- quite active Check the Spark global events listing to see about meetups and other Spark-related talks in Melbourne: https://docs.google.com/spreadsheets/d/1HKb_uwpQOOtBihRH8nBhgOHrsuy1nsGNlKwG32_qA3Y/edit#gid=0 ...and many other locations :) Paco On Thu, Jun 25, 2015

Spark Meetup Istanbul

2015-06-25 Thread Şafak Serdar Kapçı
Hello, I create a Meetup and Linkedin group in Istanbul. If it is possible can you add to list as Istanbul Meetup? There is none official Meetup in Istanbul. I am full time developer and edx student and Spark learner. I am taking both courses: BerkeleyX: CS100.1x Introduction to Big Data with

Re: SparkR parallelize not found with 1.4.1?

2015-06-25 Thread Eskilson,Aleksander
I forgot to mention that if you need to access these functions for some reason, you can prepend the function call with the SparkR private namespace, like so, SparkR:::lapply(rdd, func). On 6/25/15, 9:30 AM, Felix C felixcheun...@hotmail.com wrote: Thanks! It's good to know --- Original Message

Re: NaiveBayes for MLPipeline is absent

2015-06-25 Thread Xiangrui Meng
FYI, I made a JIRA for this: https://issues.apache.org/jira/browse/SPARK-8600. -Xiangrui On Fri, Jun 19, 2015 at 3:01 PM, Xiangrui Meng men...@gmail.com wrote: Hi Justin, We plan to add it in 1.5, along with some other estimators. We are now preparing a list of JIRAs, but feel free to create

Re: mllib from sparkR

2015-06-25 Thread Shivaram Venkataraman
Not yet - We are working on it as a part of https://issues.apache.org/jira/browse/SPARK-6805 and you can follow the JIRA for more information On Wed, Jun 24, 2015 at 2:30 AM, escardovi escard...@bitbang.com wrote: Hi, I was wondering if it is possible to use MLlib function inside SparkR, as

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
Also, I've noticed that .map() actually creates a MapPartitionsRDD under the hood. SO I think the real difference is just in the API that's being exposed. You can do a map() and not have to think about the partitions at all or you can do a .mapPartitions() and be able to do things like chunking

RE: Performing sc.paralleize (..) in workers not in the driver program

2015-06-25 Thread Ganelin, Ilya
The parallelize operation accepts as input a data structure in memory. When you call it, you are necessarily operating In the memory space of the driver since that is where user code executes. Until you have an RDD, you can't really operate in a distributed way. If your files are stores in a

Re: Exception in thread main java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.elapsedMillis()J

2015-06-25 Thread Max Demoulin
Can I actually include another version of guava in the classpath when launching the example through spark submit? -- Henri Maxime Demoulin 2015-06-25 10:57 GMT-04:00 Max Demoulin max...@cs.duke.edu: I see, thank you! -- Henri Maxime Demoulin 2015-06-25 5:54 GMT-04:00 Steve Loughran

Re: How to get the memory usage infomation of a spark application

2015-06-25 Thread maxdml
You can see the amount of memory consumed by each executor in the web ui (go to the application page, and click on the executor tab). Otherwise, for a finer grained monitoring, I can only think of correlating a system monitoring tool like Ganglia, with the event timeline of your job. -- View

Re: assign unique ID (Long Value) to each line in RDD

2015-06-25 Thread neel choudhury
Hi Ravi you can do one thing. You can create a RDD with the edges and then do zipWithIndex Let a = sc.parallelize(['9:8','1:2','1:2','3,5']) a.zipWithIndex().collect() gives [('9:8', 0), ('1:2', 1), ('1:2', 2), ('3,5', 3)] Let me know if you have any other queries On Thu, Jun 25, 2015 at

Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-25 Thread Bin Wang
I am trying to run the Spark example code HBaseTest from command line using spark-submit instead run-example, in that case, I can learn more how to run spark code in general. However, it told me CLASS_NOT_FOUND about htrace since I am using CDH5.4. I successfully located the htrace jar file but I

Re: spark1.4 sparkR usage

2015-06-25 Thread Shivaram Venkataraman
The Apache Spark API docs for SparkR https://spark.apache.org/docs/1.4.0/api/R/index.html represent what has been released with Spark 1.4. The AMPLab version is no longer under active development and I'd recommend users to use the version in the Apache project. Thanks Shivaram On Thu, Jun 25,

Performing sc.paralleize (..) in workers not in the driver program

2015-06-25 Thread shahab
Hi, Apparently, sc.paralleize (..) operation is performed in the driver program not in the workers ! Is it possible to do this in worker process for the sake of scalability? best /Shahab

Fwd: map vs mapPartitions

2015-06-25 Thread Hao Ren
-- Forwarded message -- From: Hao Ren inv...@gmail.com Date: Thu, Jun 25, 2015 at 7:03 PM Subject: Re: map vs mapPartitions To: Shushant Arora shushantaror...@gmail.com In fact, map and mapPartitions produce RDD of the same type: MapPartitionsRDD. Check RDD api source code

Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
Hi, I'm trying to use spark over Azure's HDInsight but the spark-shell fails when starting: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Shivaram Venkataraman
In addition to Aleksander's point please let us know what use case would use RDD-like API in https://issues.apache.org/jira/browse/SPARK-7264 -- We are hoping to have a version of this API in upcoming releases. Thanks Shivaram On Thu, Jun 25, 2015 at 6:02 AM, Eskilson,Aleksander

assign unique ID (Long Value) to each line in RDD

2015-06-25 Thread Ravikant Dindokar
I have a file containing one line for each edge in the graph with two vertex ids (source sink). sample: 12 (here 1 is source and 2 is sink node for the edge) 15 23 42 43 I want to assign a unique Id (Long value )to each edge i.e for each line of the file. How to ensure

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-25 Thread Ted Yu
The assertion failure from TriangleCount.scala corresponds with the following lines: g.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) = val dblCount = optCounter.getOrElse(0) // double count should be even (divisible by two) assert((dblCount 1)

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time that that means each record is being pulled at 1 time. That's the beauty of exposing Iterators Iterables in an API rather than collections-

Re: Parsing a tsv file with key value pairs

2015-06-25 Thread Don Drake
Use this package: https://github.com/databricks/spark-csv and change the delimiter to a tab. The documentation is pretty straightforward, you'll get a Dataframe back from the parser. -Don On Thu, Jun 25, 2015 at 4:39 AM, Ravikant Dindokar ravikant.i...@gmail.com wrote: So I have a file

Re: How to run kmeans.py Spark example in yarn-cluster ?

2015-06-25 Thread Elkhan Dadashov
Hi all, Does Spark 1.4 version support Python applications on Yarn-cluster ? (--master yarn-cluster) Does Spark 1.4 version support Python applications with deploy-mode cluster ? (--deploy-mode cluster) How can we ship 3rd party Python dependencies with Python Spark job ? (running on Yarn

Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Elkhan Dadashov
In addition to previous emails, when i try to execute this command from command line: ./bin/spark-submit --verbose --master yarn-cluster --py-files mypython/libs/numpy-1.9.2.zip --deploy-mode cluster mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0 - numpy-1.9.2.zip - is downloaded numpy

Re:

2015-06-25 Thread Silvio Fiorito
Hi Deepak, Have you tried specifying the minimum partitions when you load the file? I haven’t tried that myself against HDFS before, so I’m not sure if it will affect data locality. Ideally not, it should still maintain data locality but just more partitions. Once your job runs, you can check

Re: java.io.NotSerializableException: org.apache.spark.SparkContext

2015-06-25 Thread ๏̯͡๏
Ok. I modified the code to remove sc as sc is never serializable and must not be passed to map functions. On Thu, Jun 25, 2015 at 11:11 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Spark Version: 1.3.1 How can SparkContext not be serializable. Any suggestions to resolve this issue ? I

java.io.NotSerializableException: org.apache.spark.SparkContext

2015-06-25 Thread ๏̯͡๏
Spark Version: 1.3.1 How can SparkContext not be serializable. Any suggestions to resolve this issue ? I included a trait + implementation (implmentation has a method that takes SC as argument) and i started seeing this exception trait DetailDataProvider[T1 : Data] extends java.io.Serializable

Scala/Python or Java

2015-06-25 Thread spark user
Hi All , I am new for spark , i just want to know which technology is good/best for spark learning ? 1) Scala 2) Java 3) Python  I know spark support all 3 languages , but which one is best ? Thanks su  

Spark 1.4.0, Secure YARN Cluster, Application Master throws 500 connection refused

2015-06-25 Thread Nachiketa
Spark 1.4.0 - Custom built from source against Hortonworks HDP 2.2 (hadoop 2.6.0+) HDP 2.2 Cluster (Secure, kerberos) spark-shell (--master yarn-client) launches fine and the prompt shows up. Clicking on the Application Master url on the YARN RM UI, throws 500 connect error. The same build works

Re: Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules to be shipped with)

2015-06-25 Thread Marcelo Vanzin
That sounds like SPARK-5479 which is not in 1.4... On Thu, Jun 25, 2015 at 12:17 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: In addition to previous emails, when i try to execute this command from command line: ./bin/spark-submit --verbose --master yarn-cluster --py-files

Recent spark sc.textFile needs hadoop for folders?!?

2015-06-25 Thread Ashic Mahtab
Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've noticed the following: * On 1.4, sc.textFile(D:\\folder\\).collect() fails from both spark-shell.cmd and when running a scala application referencing the spark-core package from maven.*

Re:

2015-06-25 Thread ๏̯͡๏
How can i increase the number of tasks from 174 to 500 without running repartition. The input size is 512.0 MB (hadoop) / 4159106. Can this be reduced to 64 MB so as to increase the number of tasks. Similar to split size that increases the number of mappers in Hadoop M/R. On Thu, Jun 25, 2015 at

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Peter Rudenko
Hi Daniel, yes it supported, however you need to add hadoop-azure.jar to classpath of spark shell (http://search.maven.org/#search%7Cga%7C1%7Chadoop-azure - it's available only for hadoop-2.7.0). Try to find it on your node and run: export CLASSPATH=$CLASSPATH:hadoop-azure.jar spark-shell

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Silvio Fiorito
Hi Daniel, As Peter pointed out you need the hadoop-azure JAR as well as the Azure storage SDK for Java (com.microsoft.azure:azure-storage). Even though the WASB driver is built for 2.7, I was still able to use the hadoop-azure JAR with Spark built for older Hadoop versions, back to 2.4 I

Re: sparkR could not find function textFile

2015-06-25 Thread Wei Zhou
I tried out the solution using spark-csv package, and it worked fine now :) Thanks. Yes, I'm playing with a file with all columns as String, but the real data I want to process are all doubles. I'm just exploring what sparkR can do versus regular scala spark, as I am by heart a R person.

RE: Using Spark on Azure Blob Storage

2015-06-25 Thread Jacob Kim
Below is the link for step by step guide in how to setup and use Spark in HDInsight. https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/ Jacob From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: Thursday, June 25, 2015 3:19 PM To: Silvio

reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan
Hi, I am trying to see what is the best way to reduce the values of a RDD of (key,value) pairs into (key,ListOfValues) pair. I know various ways of achieving this, but I am looking for a efficient, elegant one-liner if there is one. Example: Input RDD: (USA, California), (UK, Yorkshire),

Re: sql dataframe internal representation

2015-06-25 Thread Michael Armbrust
In many cases we use more efficient mutable implementations internally (i.e. mutable undecoded utf8 instead of java.lang.String, or a BigDecimal implementation that uses a Long when the number is small enough). On Thu, Jun 25, 2015 at 1:56 PM, Koert Kuipers ko...@tresata.com wrote: i noticed in

Re:

2015-06-25 Thread Silvio Fiorito
Ok, in that case I think you can set the max split size in the Hadoop config object, using the FileInputFormat.SPLIT_MAXSIZE config parameter. Again, I haven’t done this myself, but looking through the Spark codebase here:

Re: sparkR could not find function textFile

2015-06-25 Thread Wei Zhou
Thanks Shivaram, this is exactly what I am looking for. 2015-06-25 14:22 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu : You can use the Spark CSV reader to do read in flat CSV files to a data frame. See https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85 for an example

Re:

2015-06-25 Thread ๏̯͡๏
I use sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](path + /*.avro) https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/SparkContext.html#newAPIHadoopFile(java.lang.String, java.lang.Class, java.lang.Class, java.lang.Class,

Executors requested are way less than what i actually got

2015-06-25 Thread ๏̯͡๏
I run Spark App on Spark 1.3.1 over YARN. When i request --num-executors 9973 and when i see Executors from Environment tab from SPARK UI its between 200 to 300. What is incorrect here ? -- Deepak

Spark RDS data insertion

2015-06-25 Thread Bill Milan
Hi all, I am running a program which connects to Amazon RDS and generate some data from S3 into RDD. When I run rdd.collect and insert the results into RDS using JDBC, I get communication link failure. I tried to insert results into RDS using both python and mysql client in the master machine and

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Wei Zhou
Hi Shivaram/Alek, I understand that a better way to import data is to DataFrame rather than RDD. If one wants to do a map-like transformation for such row in sparkR, one could use sparkR:::lapply(), but is there a counterpart row operation on DataFrame? The use case I am working on requires

Re: Scala/Python or Java

2015-06-25 Thread Kannappan Sirchabesan
Hi, If you are new to all three languages, go with Scala or Python. Python is easier but check out Scala and see if it is easy enough for you. With the launch of data frames, it might not even matter which language you choose performance-wise. Thanks, Kannappan On Jun 25, 2015, at 10:02

Re: How to Map and Reduce in sparkR

2015-06-25 Thread Wei Zhou
Thanks Shivaram. For those who prefer to watch the video version for the talk, like me, you can actually register for spark summit live stream 2015 free of cost. I personally find the talk extremely helpful. 2015-06-25 15:20 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu : We don't

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
Thank you guys for the helpful answers. Daniel On 25 ביוני 2015, at 21:23, Silvio Fiorito silvio.fior...@granturing.com wrote: Hi Daniel, As Peter pointed out you need the hadoop-azure JAR as well as the Azure storage SDK for Java (com.microsoft.azure:azure-storage). Even though the

Vision old applications in webui with json logs

2015-06-25 Thread maxdml
Is it possible to recreate the same views given in the webui for completed applications, when rebooting the master, thanks to the log files? I just tried to change the url of the form http://w.x.y.z:8080/history/app-2-0036, by giving the appID, but it redirected me on the master's

Re: reduceByKey - add values to a list

2015-06-25 Thread Sven Krasser
Hey Kannappan, First of all, what is the reason for avoiding groupByKey since this is exactly what it is for? If you must use reduceByKey with a one-liner, then take a look at this: lambda a,b: (a if type(a) == list else [a]) + (b if type(b) == list else [b]) In contrast to groupByKey, this

Re: reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan
Thanks. This should work fine. I am trying to avoid groupByKey for performance reasons as the input is a giant RDD. and the operation is a associative operation, so minimal shuffle if done via reduceByKey. On Jun 26, 2015, at 12:25 AM, Sven Krasser kras...@gmail.com wrote: Hey Kannappan,

Re: Spark1.4.0 compiling error with java1.6.0_20: sun.misc.Unsafe cannot be applied to (java.lang.Object,long,java.lang.Object,long,long)

2015-06-25 Thread Ted Yu
Looks like the Java 1.6 version of copyMemory doesn't support specification of offsets. This means extra memory copy. Can you upgrade your Java version ? Thanks On Jun 25, 2015, at 6:35 PM, 胡安扬 zzu...@163.com wrote: Hi ,all: When compiling spark1.4.0 with java1.6.0_20 (maven

Fw:Re:Re: Spark1.4.0 compiling error with java1.6.0_20: sun.misc.Unsafe cannot be applied to (java.lang.Object,long,java.lang.Object,long,long)

2015-06-25 Thread Young
+all user Forwarding messages From: Young zzu...@163.com Date: 2015-06-26 10:31:19 To: Ted Yu yuzhih...@gmail.com Subject: Re:Re: Spark1.4.0 compiling error with java1.6.0_20: sun.misc.Unsafe cannot be applied to (java.lang.Object,long,java.lang.Object,long,long) Thanks for

Re: Spark SQL JDBC Source data skew

2015-06-25 Thread Sathish Kumaran Vairavelu
Can some one help me here? Please On Sat, Jun 20, 2015 at 9:54 AM Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hi, In Spark SQL JDBC data source there is an option to specify upper/lower bound and num of partitions. How Spark handles data distribution, if we do not give the

Re: Failed to save RDD to text File in windows OS

2015-06-25 Thread stati
Pls follow instruction given in below links. https://issues.apache.org/jira/browse/SPARK-6961 https://issues.apache.org/jira/browse/SPARK-6961 http://www.srccodes.com/p/article/39/error-util-shell-failed-locate-winutils-binary-hadoop-binary-path

Re: reduceByKey - add values to a list

2015-06-25 Thread Sven Krasser
In that case the reduceByKey operation will likely not give you any benefit (since you are not aggregating data into smaller values but instead building the same large list you'd build with groupByKey). If you look at rdd.py, you can see that both operations eventually use a similar operation to

  1   2   >