Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Tim Chen
Does YARN provide the token through that env variable you mentioned? Or how does YARN do this? Tim On Fri, Jun 26, 2015 at 3:51 PM, Marcelo Vanzin wrote: > On Fri, Jun 26, 2015 at 3:44 PM, Dave Ariens > wrote: > >> Fair. I will look into an alternative with a generated delegation >> token.

Re: dataframe left joins are not working as expected in pyspark

2015-06-26 Thread Axel Dahl
still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha wrote: > You can declare the schema with unique names before creation of df. > On 27 Jun 2015 13:01, "Axel Dahl" wrote: > >> >> I have the following code: >> >> from pyspark import SQ

Re: dataframe left joins are not working as expected in pyspark

2015-06-26 Thread ayan guha
You can declare the schema with unique names before creation of df. On 27 Jun 2015 13:01, "Axel Dahl" wrote: > > I have the following code: > > from pyspark import SQLContext > > d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', > 'country': 'jpn', 'age': 2}, {'name':'carol', 'co

Re: Executors requested are way less than what i actually got

2015-06-26 Thread ๏̯͡๏
How many nodes do you have: 171.42 TB for a total of 2040 nodes. how much space is allocated to each node for YARN: 14 G max for each container. any thing beyond causes failure how big are the executors you're requesting, *9973* and what else is running on the cluster? There are 1000s of other YARN

Re: Join highly skewed datasets

2015-06-26 Thread ๏̯͡๏
This is nice. Which version of Spark has this support ? Or do I need to build it. I have never built Spark from git, please share instructions for Hadoop 2.4.x YARN. I am struggling a lot to get a join work between 200G and 2TB datasets. I am constantly getting this exception 1000s of executors a

dataframe left joins are not working as expected in pyspark

2015-06-26 Thread Axel Dahl
I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire', 'colou

Re: spark streaming with kafka reset offset

2015-06-26 Thread Cody Koeninger
Read the spark streaming guide ad the kafka integration guide for a better understanding of how the receiver based stream works. Capacity planning is specific to your environment and what the job is actually doing, youll need to determine it empirically. On Friday, June 26, 2015, Shushant Arora

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread XianXing Zhang
Yes we deployed Spark on top of Yarn. What you suggested is very helpful, I increased the Yarn memory overhead option and it helped in most cases. (Sometime it still has some failures when the amount of data to be shuffled is large, but I guess if I continue increasing the Yarn memory overhead opt

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Shivaram Venkataraman
My workflow as to install RStudio on a cluster launched using Spark EC2 scripts. However I did a bunch of tweaking after that (like copying the spark installation over etc.). When I get some time I'll try to write the steps down in the JIRA. Thanks Shivaram On Fri, Jun 26, 2015 at 10:21 AM, wrot

Re: spark streaming with kafka reset offset

2015-06-26 Thread Shushant Arora
In 1.2 how to handle offset management after stream application starts in each job . I should commit offset after job completion manually? And what is recommended no of consumer threads. Say I have 300 partitions in kafka cluster . Load is ~ 1 million events per second.Each event is of ~500bytes.

Re: sparkR could not find function "textFile"

2015-06-26 Thread Wei Zhou
Yeah, I noticed all columns are cast into strings. Thanks Alek for pointing out the solution before I even encountered the problem. 2015-06-26 7:01 GMT-07:00 Eskilson,Aleksander : > Yeah, I ask because you might notice that by default the column types > for CSV tables read in by read.df() are on

Re: HOw to concatenate two csv files into one RDD?

2015-06-26 Thread Sujit Pal
Hi Rex, If the CSV files are in the same folder and there are no other files, specifying the directory to sc.textFiles() (or equivalent) will pull in all the files. If there are other files, you can pass in a pattern that would capture the two files you care about (if thats possible). If neither o

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 3:44 PM, Dave Ariens wrote: > Fair. I will look into an alternative with a generated delegation token. > However the same issue exists. How can I have the executor run some > arbitrary code when it gets a task assignment and before it proceeds to > process it's resour

Re: Executors requested are way less than what i actually got

2015-06-26 Thread Sandy Ryza
The scheduler configurations are helpful as well, but not useful without the information outlined above. -Sandy On Fri, Jun 26, 2015 at 10:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > These are my YARN queue configurations > > Queue State:RUNNINGUsed Capacity:206.7%Absolute Used Capacity:3.1%Absolute > Capaci

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
Fair. I will look into an alternative with a generated delegation token. However the same issue exists. How can I have the executor run some arbitrary code when it gets a task assignment and before it proceeds to process it's resources? From: Marcelo Vanzin Sent: Friday, June 26, 2015 6:20

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens wrote: > Would there be any way to have the task instances in the slaves call the > UGI login with a principal/keytab provided to the driver? > That would only work with a very small number of executors. If you have many login requests in a short per

Re: Join highly skewed datasets

2015-06-26 Thread Koert Kuipers
we went through a similar process, switching from scalding (where everything just works on large datasets) to spark (where it does not). spark can be made to work on very large datasets, it just requires a little more effort. pay attention to your storage levels (should be memory-and-disk or disk-

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
This would be fantastic to take advantage of once it's available and I agree that YARNs implementation would be ideal to base it off.I'm wondering if there might be an interim work around anyone could think of ‎in the meantime though. Would there be any way to have the task instances in th

Re: Join highly skewed datasets

2015-06-26 Thread ๏̯͡๏
Not far at all. On large data sets everything simply fails with Spark. Worst is am not able to figure out the reason of failure, the logs run into millions of lines and i do not know the keywords to search for failure reason On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf wrote: > How far did you g

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Ashish Nigam
Here's code - def createStreamingContext(checkpointDirectory: String) : StreamingContext = { val conf = new SparkConf().setAppName("KafkaConsumer") conf.set("spark.eventLog.enabled", "false") logger.info("Going to init spark context") conf.getOption("spark.master") match {

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Too many values to unpack

2015-06-26 Thread Ayman Farahat
I tried something similar and got oration error I had 10 executors and 10 8 cores >>> ratings = newrdd.map(lambda l: >>> Rating(int(l[1]),int(l[2]),l[4])).partitionBy(50) >>> mypart = ratings.getNumPartitions() >>> mypart 50 >>> numIterations =10 >>> rank = 100 >>> model = ALS.trainImplicit(rati

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 2:08 PM, Tim Chen wrote: > Mesos do support running containers as specific users passed to it. > Thanks for chiming in, what else does YARN do with Kerberos besides keytab > file and user? > The basic things I'd expect from a system to properly support Kerberos would be:

Unable to start Pi (hello world) application on Spark 1.4

2015-06-26 Thread ๏̯͡๏
It used to work with 1.3.1, however with 1.4.0 i get the following exception export SPARK_HOME=/home/dvasthimal/spark1.4/spark-1.4.0-bin-hadoop2.4 export SPARK_JAR=/home/dvasthimal/spark1.4/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar export HADOOP_CONF_DIR=/apache/hadoop/co

Spark 1.4 - memory bloat in group by/aggregate???

2015-06-26 Thread Manoj Samel
Hi, - Spark 1.4 on a single node machine. Run spark-shell - Reading from Parquet file with bunch of text columns and couple of amounts in decimal(14,4). On disk size of of the file is 376M. It has ~100 million rows - rdd1 = sqlcontext.read.parquet - rdd1.cache - group_by_df =

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Tim Chen
Mesos do support running containers as specific users passed to it. Thanks for chiming in, what else does YARN do with Kerberos besides keytab file and user? Tim On Fri, Jun 26, 2015 at 1:20 PM, Marcelo Vanzin wrote: > On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen wrote: > >> So correct me if I'm

Re: spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Cody Koeninger
Make sure you're following the docs regarding setting up a streaming checkpoint. Post your code if you can't get it figured out. On Fri, Jun 26, 2015 at 3:45 PM, Ashish Nigam wrote: > I bring up spark streaming job that uses Kafka as input source. > No data to process and then shut it down. And

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
There's a few security related issues that I am postponing dealing with. Once I get this working I'll look at the security side. Likely I'll be encouraging users to submit their jobs via docker containers. Regardless, getting the users keytab and principal name in the working environment o

spark streaming job fails to restart after checkpointing due to DStream initialization errors

2015-06-26 Thread Ashish Nigam
I bring up spark streaming job that uses Kafka as input source. No data to process and then shut it down. And bring it back again. This time job does not start because it complains that DStream is not initialized. 15/06/26 01:10:44 ERROR yarn.ApplicationMaster: User class threw exception: org.apac

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread Eugen Cepoi
Are you using yarn? If yes increase the yarn memory overhead option. Yarn is probably killing your executors. Le 26 juin 2015 20:43, "XianXing Zhang" a écrit : > Do we have any update on this thread? Has anyone met and solved similar > problems before? > > Any pointers will be greatly appreciated

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen wrote: > So correct me if I'm wrong, sounds like all you need is a principal user > name and also a keytab file downloaded right? > I'm not familiar with Mesos so don't know what kinds of features it has, but at the very least it would need to start cont

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
how do i set these partitons? is this is the call to ALS model = ALS.trainImplicit(ratings, rank, numIterations)? On Jun 26, 2015, at 12:33 PM, Xiangrui Meng wrote: > So you have 100 partitions (blocks). This might be too many for your dataset. > Try setting a smaller number of blocks, e.g.,

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Tim Chen
So correct me if I'm wrong, sounds like all you need is a principal user name and also a keytab file downloaded right? I'm adding support from spark framework to download additional files along side your executor and driver, and one workaround is to specify a user principal and keytab file that ca

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
I set the number of partitions on the input dataset at 50. The number of CPU cores I'm using is 84 (7 executors, 12 cores). I'll look into getting a full stack trace. Any idea what my errors mean, and why increasing memory causes them to go away? Thanks. On Fri, Jun 26, 2015 at 11:26 AM, Xiangrui

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
So you have 100 partitions (blocks). This might be too many for your dataset. Try setting a smaller number of blocks, e.g., 32 or 64. When ALS starts iterations, you can see the shuffle read/write size from the "stages" tab of Spark WebUI. Vary number of blocks and check the numbers there. Kyro ser

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Olivier Girardot
I would pretty much need exactly this kind of feature too Le ven. 26 juin 2015 à 21:17, Dave Ariens a écrit : > Hi Timothy, > > > > Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I > need to ensure that my tasks running on the slaves perform a Kerberos login > before acc

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
Hi Timothy, Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I need to ensure that my tasks running on the slaves perform a Kerberos login before accessing any HDFS resources. To login, they just need the name of the principal (username) and a keytab file. Then they just

Re: Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Mark Hamstra
Do you want to transform the RDD, or just produce some side effect with its contents? If the latter, you want foreachPartition, not mapPartitions. On Fri, Jun 26, 2015 at 11:52 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > In rdd.mapPartition(…) if I try to iterate through

Cannot iterate items in rdd.mapPartition()

2015-06-26 Thread Wang, Ningjun (LNG-NPV)
In rdd.mapPartition(...) if I try to iterate through the items in the partition, everything screw. For example val rdd = sc.parallelize(1 to 1000, 3) val count = rdd.mapPartitions(iter => { println(iter.length) iter }).count() The count is 0. This is incorrect. The count should be 1000. If

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread XianXing Zhang
Do we have any update on this thread? Has anyone met and solved similar problems before? Any pointers will be greatly appreciated! Best, XianXing On Mon, Jun 15, 2015 at 11:48 PM, Jia Yu wrote: > Hi Peng, > > I got exactly same error! My shuffle data is also very large. Have you > figured out

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
Thanks for the awesome response, Steve. As you say, it's not ideal, but the clarification greatly helps. Cheers, everyone :) -Ashic. Subject: Re: Recent spark sc.textFile needs hadoop for folders?!? From: ste...@hortonworks.com To: as...@live.com CC: guha.a...@gmail.com; user@spark.apache.org Date

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
Hi, we are on 1.3.1 right now so in case there are differences in the Spark files I'll walk through the logic of what we did and post a couple gists at the end. We haven't committed to forking Spark for our own deployments yet, so right now we shadow some Spark classes in our application code with

HOw to concatenate two csv files into one RDD?

2015-06-26 Thread Rex X
With Python Pandas, it is easy to do concatenation of dataframes by combining pandas.concat and pandas.read_csv pd.concat([pd.read_csv(os.path.join(Path_to_csv_files, f)) for f in csvfiles]) where "csvfiles" is the list o

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
Hello ; I checked on my partitions/storage and here is what I have I have 80 executors 5 G per executore. Do i need to set additional params say cores spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOp

YARN worker out of disk memory

2015-06-26 Thread Tarun Garg
Hi, I am running a spark job over yarn, after 2-3 hr execution workers start dieing and i found that a lot of file at /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1435184713615_0008/blockmgr-333f0ade-2474-43a6-9960-f08a15bcc7b7/3f named temp_shuffle. my job is kakfastream.map

Re:

2015-06-26 Thread Silvio Fiorito
No worries, glad to help! It also helped me as I had not worked directly with the Hadoop APIs for controlling splits. From: "ÐΞ€ρ@Ҝ (๏̯͡๏)" Date: Friday, June 26, 2015 at 1:31 PM To: Silvio Fiorito Cc: user Subject: Re: Silvio, Thanks for your responses and patience. It worked after i reshuffled

Re: Executors requested are way less than what i actually got

2015-06-26 Thread ๏̯͡๏
These are my YARN queue configurations Queue State:RUNNINGUsed Capacity:206.7%Absolute Used Capacity:3.1%Absolute Capacity:1.5%Absolute Max Capacity:10.0%Used Resources:Num Schedulable Applications:7Num Non-Schedulable Applications:0Num Containers:390Max Applications:45Max Applications Per User:27

Re:

2015-06-26 Thread ๏̯͡๏
Silvio, Thanks for your responses and patience. It worked after i reshuffled the arguments and removed avro dependencies. On Fri, Jun 26, 2015 at 9:55 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > OK, here’s how I did it, using just the built-in Avro libraries with > Spark 1.3: >

Spark SQL - Setting YARN Classpath for primordial class loader

2015-06-26 Thread Kumaran Mani
Hi, The response to the below thread for making yarn-client mode work by adding the JDBC driver JAR to spark.{driver,executor}.extraClassPath works fine. http://mail-archives.us.apache.org/mod_mbox/spark-user/201504.mbox/%3CCAAOnQ7vHeBwDU2_EYeMuQLyVZ77+N_jDGuinxOB=sff2lkc...@mail.gmail.com%3E Bu

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread mark
So you created an EC2 instance with RStudio installed first, then installed Spark under that same username?  That makes sense, I just want to verify your work flow. Thank you again for your willingness to help! On Fri, Jun 26, 2015 at 10:13 AM -0700, "Shivaram Venkataraman" wrote:

Re: Unable to specify multiple directories as input

2015-06-26 Thread ๏̯͡๏
So for each directory you create one RDD and then union them all. On Fri, Jun 26, 2015 at 10:05 AM, Bahubali Jain wrote: > oh..my use case is not very straight forward. > The input can have multiple directories... > > On Fri, Jun 26, 2015 at 9:30 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wrote: > >> Yes, only workin

Re: Dependency Injection with Spark Java

2015-06-26 Thread Igor Berman
asked myself same question today...actually depends on what you are trying to do if you want injection into workers code I think it will be a bit hard... if only in code that driver executes i.e. in main, it's straight forward imho, just create your classes from injector(e.g. spring's application c

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Shivaram Venkataraman
I was using RStudio on the master node of the same cluster in the demo. However I had installed Spark under the user `rstudio` (i.e. /home/rstudio) and that will make the permissions work correctly. You will need to copy the config files from /root/spark/conf after installing Spark though and it mi

spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-26 Thread igor.berman
Hi, wanted to get some advice regarding tunning spark application I see for some of the tasks many log entries like this Executor task launch worker-38 ExternalAppendOnlyMap: Thread 239 spilling in-memory map of 5.1 MB to disk (272 times so far) (especially when inputs are considerable) I understan

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Mark Stephenson
Thanks! In your demo video, were you using RStudio to hit a separate EC2 Spark cluster? I noticed that it appeared your browser that you were using EC2 at that time, so I was just curious. It appears that might be one of the possible workarounds - fire up a separate EC2 instance with RStudio

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Also, I get TaskContext.get() null when used in foreach function below ( I get it when I use it in map, but the whole point here is to handle something that is breaking in action ). Please help. :( From: amit assudani mailto:aassud...@impetus.com>> Date: Friday, June 26, 2015 at 11:41 AM To: Cod

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi
You can comma separate them or use globbing patterns 2015-06-26 18:54 GMT+02:00 Ted Yu : > See this related thread: > http://search-hadoop.com/m/q3RTtiYm8wgHego1 > > On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain wrote: > >> >> Hi, >> How do we read files from multiple directories using newApiHa

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi
Comma separated paths works only with spark 1.4 and up 2015-06-26 18:56 GMT+02:00 Eugen Cepoi : > You can comma separate them or use globbing patterns > > 2015-06-26 18:54 GMT+02:00 Ted Yu : > >> See this related thread: >> http://search-hadoop.com/m/q3RTtiYm8wgHego1 >> >> On Fri, Jun 26, 2015 at

Re:

2015-06-26 Thread Silvio Fiorito
OK, here’s how I did it, using just the built-in Avro libraries with Spark 1.3: import org.apache.avro.generic.{GenericData, GenericRecord} import org.apache.avro.mapred.AvroKey import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.io.NullWritable import org.apache.hadoop.ma

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Ted Yu
See this related thread: http://search-hadoop.com/m/q3RTtiYm8wgHego1 On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain wrote: > > Hi, > How do we read files from multiple directories using newApiHadoopFile () ? > > Thanks, > Baahu > -- > Twitter:http://twitter.com/Baahu > >

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Timothy Chen
Hi Dave, I don't understand Keeberos much but if you know the exact steps that needs to happen I can see how we can make that happen with the Spark framework. Tim > On Jun 26, 2015, at 8:49 AM, Dave Ariens wrote: > > I understand that Kerberos support for accessing Hadoop resources in Spark

Re: Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread Shivaram Venkataraman
We don't have a documented way to use RStudio on EC2 right now. We have a ticket open at https://issues.apache.org/jira/browse/SPARK-8596 to discuss work-arounds and potential solutions for this. Thanks Shivaram On Fri, Jun 26, 2015 at 6:27 AM, RedOakMark wrote: > Good morning, > > I am having

Multiple dir support : newApiHadoopFile

2015-06-26 Thread Bahubali Jain
Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Re: Problem after enabling Hadoop native libraries

2015-06-26 Thread Marcelo Vanzin
What master are you using? If this is not a "local" master, you'll need to set LD_LIBRARY_PATH on the executors also (using spark.executor.extraLibraryPath). If you are using local, then I don't know what's going on. On Fri, Jun 26, 2015 at 1:39 AM, Arunabha Ghosh wrote: > Hi, > I'm having

Re: GraphX - ConnectedComponents (Pregel) - longer and longer interval between jobs

2015-06-26 Thread Thomas Gerber
Note that this problem is probably NOT caused directly by GraphX, but GraphX reveals it because as you go further down the iterations, you get further and further away of a shuffle you can rely on. On Thu, Jun 25, 2015 at 7:43 PM, Thomas Gerber wrote: > Hello, > > We run GraphX ConnectedComponen

Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
I understand that Kerberos support for accessing Hadoop resources in Spark only works when running Spark on YARN. However, I'd really like to hack something together for Spark on Mesos running alongside a secured Hadoop cluster. My simplified appplication (gist: https://gist.github.com/ariens

Re: Kryo serialization of classes in additional jars

2015-06-26 Thread patcharee
Hi, I am having this problem on spark 1.4. Do you have any ideas how to solve it? I tried to use spark.executor.extraClassPath, but it did not help BR, Patcharee On 04. mai 2015 23:47, Imran Rashid wrote: Oh, this seems like a real pain. You should file a jira, I didn't see an open issue --

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Hmm, not sure why, but when I run this code, it always keeps on consuming from Kafka and proceeds ignoring the previous failed batches, Also, Now that I get the attempt number from TaskContext and I have information of max retries, I am supposed to handle it in the try/catch block, but does it

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
No, if you have a bad message that you are continually throwing exceptions on, your stream will not progress to future batches. On Fri, Jun 26, 2015 at 10:28 AM, Amit Assudani wrote: > Also, what I understand is, max failures doesn’t stop the entire stream, > it fails the job created for the sp

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Also, what I understand is, max failures doesn’t stop the entire stream, it fails the job created for the specific batch, but the subsequent batches still proceed, isn’t it right ? And question still remains, how to keep track of those failed batches ? From: amit assudani mailto:aassud...@impet

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
TaskContext has an attemptNumber method on it. If you want to know which messages failed, you have access to the offsets, and can do whatever you need to with them. On Fri, Jun 26, 2015 at 10:21 AM, Amit Assudani wrote: > Thanks for quick response, > > My question here is how do I know that t

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
No, they use the same implementation. On Fri, Jun 26, 2015 at 8:05 AM, Ayman Farahat wrote: > I use the mllib not the ML. Does that make a difference ? > > Sent from my iPhone > > On Jun 26, 2015, at 7:19 AM, Ravi Mody wrote: > > Forgot to mention: rank of 100 usually works ok, 120 consistently

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Xiangrui Meng
Please see my comments inline. It would be helpful if you can attach the full stack trace. -Xiangrui On Fri, Jun 26, 2015 at 7:18 AM, Ravi Mody wrote: > 1. These are my settings: > rank = 100 > iterations = 12 > users = ~20M > items = ~2M > training examples = ~500M-1B (I'm running into the issue

Re: How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Thanks for quick response, My question here is how do I know that the max retries are done ( because in my code I never know whether it is failure of first try or the last try ) and I need to handle this message, is there any callback ? Also, I know the limitation of checkpoint in upgrading the

Re: How to recover in case user errors in streaming

2015-06-26 Thread Cody Koeninger
If you're consistently throwing exceptions and thus failing tasks, once you reach max failures the whole stream will stop. It's up to you to either catch those exceptions, or restart your stream appropriately once it stops. Keep in mind that if you're relying on checkpoints, and fixing the error

Re: spark streaming with kafka reset offset

2015-06-26 Thread Cody Koeninger
The receiver-based kafka createStream in spark 1.2 uses zookeeper to store offsets. If you want finer-grained control over offsets, you can update the values in zookeeper yourself before starting the job. createDirectStream in spark 1.3 is still marked as experimental, and subject to change. Tha

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ayman Farahat
I use the mllib not the ML. Does that make a difference ? Sent from my iPhone > On Jun 26, 2015, at 7:19 AM, Ravi Mody wrote: > > Forgot to mention: rank of 100 usually works ok, 120 consistently cannot > finish. > >> On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody wrote: >> 1. These are my set

Re:

2015-06-26 Thread ๏̯͡๏
org.apache.avro avro 1.7.7 provided com.databricks spark-avro_2.10 1.0.0 org.apache.avro avro-mapred 1.7.7 hadoop2 provided On Fri, Jun 26, 2015 at 8:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Same code of yours works for me as well > > On Fri, Jun 26, 2015 at 8:02 AM,

Re:

2015-06-26 Thread ๏̯͡๏
Same code of yours works for me as well On Fri, Jun 26, 2015 at 8:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Is that its not supported with Avro. Unlikely. > > On Fri, Jun 26, 2015 at 8:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) > wrote: > >> My imports: >> >> import org.apache.avro.generic.GenericData >> >> import org.apache.av

Re:

2015-06-26 Thread ๏̯͡๏
Is that its not supported with Avro. Unlikely. On Fri, Jun 26, 2015 at 8:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > My imports: > > import org.apache.avro.generic.GenericData > > import org.apache.avro.generic.GenericRecord > > import org.apache.avro.mapred.AvroKey > > import org.apache.avro.Schema > > impor

Re:

2015-06-26 Thread ๏̯͡๏
My imports: import org.apache.avro.generic.GenericData import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.AvroKey import org.apache.avro.Schema import org.apache.hadoop.io.NullWritable import org.apache.avro.mapreduce.AvroKeyInputFormat import org.apache.hadoop.conf.C

Re: Spark driver hangs on start of job

2015-06-26 Thread Richard Marscher
We've seen this issue as well in production. We also aren't sure what causes it, but have just recently shaded some of the Spark code in TaskSchedulerImpl that we use to effectively bubble up an exception from Spark instead of zombie in this situation. If you are interested I can go into more detai

Re:

2015-06-26 Thread Silvio Fiorito
Make sure you’re importing the right namespace for Hadoop v2.0. This is what I tried: import org.apache.hadoop.io.{LongWritable, Text} import org.apache.hadoop.mapreduce.lib.input.{FileInputFormat, TextInputFormat} val hadoopConf = new org.apache.hadoop.conf.Configuration() hadoopConf.setLong(Fi

Re: hadoop input/output format advanced control

2015-06-26 Thread ๏̯͡๏
I am trying the very same thing to configure min split size with Spark 1.3.1 and i get compilation error Code: val hadoopConfiguration = new Configuration(sc.hadoopConfiguration) hadoopConfiguration.set("mapreduce.input.fileinputformat.split.maxsize", "67108864") sc.newAPIHadoopFile

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Benjamin Fradet
There is one for the key of your Kafka message and one for its value. On 26 Jun 2015 4:21 pm, "Ashish Soni" wrote: > my question is why there are similar two parameter String.Class and > StringDecoder.class what is the difference each of them ? > > Ashish > > On Fri, Jun 26, 2015 at 8:53 AM, Akhi

Re: Master dies after program finishes normally

2015-06-26 Thread Yifan LI
Hi, I just encountered the same problem, when I run a PageRank program which has lots of stages(iterations)… The master was lost after my program done. And, the issue still remains even I increased driver memory. Have any idea? e.g. how to increase the master memory? Thanks. Best, Yifan LI

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Ashish Soni
my question is why there are similar two parameter String.Class and StringDecoder.class what is the difference each of them ? Ashish On Fri, Jun 26, 2015 at 8:53 AM, Akhil Das wrote: > ​JavaPairInputDStream messages = > KafkaUtils.createDirectStream( > jssc, > String.class, >

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
Forgot to mention: rank of 100 usually works ok, 120 consistently cannot finish. On Fri, Jun 26, 2015 at 10:18 AM, Ravi Mody wrote: > 1. These are my settings: > rank = 100 > iterations = 12 > users = ~20M > items = ~2M > training examples = ~500M-1B (I'm running into the issue even with 500M >

Re:

2015-06-26 Thread ๏̯͡๏
All these throw compilation error at newAPIHadoopFile 1) val hadoopConfiguration = new Configuration() hadoopConfiguration.set("mapreduce.input.fileinputformat.split.maxsize", "67108864") sc.newAPIHadoopFile[AvroKey, NullWritable, AvroKeyInputFormat](path + "/*.avro", classOf[AvroKey],

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS

2015-06-26 Thread Ravi Mody
1. These are my settings: rank = 100 iterations = 12 users = ~20M items = ~2M training examples = ~500M-1B (I'm running into the issue even with 500M training examples) 2. The memory storage never seems to go too high. The user blocks may go up to ~10Gb, and each executor will have a few GB used o

How to recover in case user errors in streaming

2015-06-26 Thread Amit Assudani
Problem: how do we recover from user errors (connectivity issues / storage service down / etc.)? Environment: Spark streaming using Kafka Direct Streams Code Snippet: HashSet topicsSet = new HashSet(Arrays.asList("kafkaTopic1")); HashMap kafkaParams = new HashMap(); kafkaParams.put("metadata.brok

Re: sparkR could not find function "textFile"

2015-06-26 Thread Eskilson,Aleksander
Yeah, I ask because you might notice that by default the column types for CSV tables read in by read.df() are only strings (due to limitations in type inferencing in the DataBricks package). There was a separate discussion about schema inferencing, and Shivaram recently merged support for specif

Spark 1.4.0 - Using SparkR on EC2 Instance

2015-06-26 Thread RedOakMark
Good morning, I am having a bit of trouble finalizing the installation and usage of the newest Spark version 1.4.0, deploying to an Amazon EC2 instance and using RStudio to run on top of it. Using these instructions ( http://spark.apache.org/docs/latest/ec2-scripts.html

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Roberto Coluccio
I got a similar issue. Might your as well be related to this https://issues.apache.org/jira/browse/SPARK-8368 ? On Fri, Jun 26, 2015 at 2:00 PM, Akhil Das wrote: > Those provided spark libraries are compatible with scala 2.11? > > Thanks > Best Regards > > On Fri, Jun 26, 2015 at 4:48 PM, Srikan

Time series data

2015-06-26 Thread Caio Cesar Trucolo
Hi everyone! I am working with multiple time series data and in summary I have to adjust each time series (like inserting average values in data gaps) and then training regression models with mllib for each time series. The adjustment step I did with the adjustement function being mapped for each

spark streaming - checkpoint

2015-06-26 Thread ram kumar
Hi, - JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1)); ssc.checkpoint(checkPointDir); JavaStreamingContextFactory factory = new JavaStreamingContextFactory() { public JavaStreamingContext create() {

?????? Time is ugly in Spark Streaming....

2015-06-26 Thread Sea
Yes, I make it. -- -- ??: "Gerard Maas";; : 2015??6??26??(??) 5:40 ??: "Sea"<261810...@qq.com>; : "user"; "dev"; : Re: Time is ugly in Spark Streaming Are you sharing the SimpleDateFormat instance? This looks a lo

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Akhil Das
​JavaPairInputDStream messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet ); Here: jssc => JavaStreamingContext String.class => Key , Value classes

Dependency Injection with Spark Java

2015-06-26 Thread Michal Čizmazia
How to use Dependency Injection with Spark Java? Please could you point me to any articles/frameworks? Thanks!

Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Ashish Soni
Hi , If i have a below data format , how can i use kafka direct stream to de-serialize as i am not able to understand all the parameter i need to pass , Can some one explain what will be the arguments as i am not clear about this JavaPairInputDStream , V > org .apache .spark .streaming .kafk

RE: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Huang, Jie
Thanks. In general, we can see a stable trend in Spark master branch and latest release. And we are also considering to add more benchmarks/workloads into this automation perf tool. Any comment and feedback is warmly welcomed. Thank you && Best Regards, Grace (Huang Jie) From: Nan Zhu [mailto:

  1   2   >