Re: AM creation in yarn client mode

2016-02-10 Thread Steve Loughran
On 10 Feb 2016, at 13:20, Manoj Awasthi > wrote: On Wed, Feb 10, 2016 at 5:20 PM, Steve Loughran > wrote: On 10 Feb 2016, at 04:42, praveen S

broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Hi, there I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold. According to the programming guide *Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.* My question is that is

Re: AM creation in yarn client mode

2016-02-10 Thread Manoj Awasthi
My pardon to writing that "there is no AM". I realize it! :-) :-) On Wed, Feb 10, 2016 at 7:14 PM, Steve Loughran wrote: > > On 10 Feb 2016, at 13:20, Manoj Awasthi wrote: > > > > On Wed, Feb 10, 2016 at 5:20 PM, Steve Loughran

Re: [Spark Streaming] Spark Streaming dropping last lines

2016-02-10 Thread Nipun Arora
Hi All, I apologize for reposting, I wonder if anyone can explain this behavior? And what would be the best way to resolve this without introducing something like kafka in the midst. I basically have a logstash instance, and would like to stream output of logstash to spark_streaming without

Re: AM creation in yarn client mode

2016-02-10 Thread Steve Loughran
On 10 Feb 2016, at 14:18, Manoj Awasthi > wrote: My pardon to writing that "there is no AM". I realize it! :-) :-) There is the unmanaged AM option, which was originally written for debugging, but has been used in various apps. Spark

Re: [Spark Streaming] Spark Streaming dropping last lines

2016-02-10 Thread Dean Wampler
Here's a wild guess; it might be the fact that your first command uses tail -f, so it doesn't close the input file handle when it hits the end of the available bytes, while your second use of nc does this. If so, the last few lines might be stuck in a buffer waiting to be forwarded. If so, Spark

Re: broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Michael Armbrust
> > My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE > compute statistics" command in Hive shell, is the statistics > going to be used by SparkSQL to decide broadcast join? Yes, spark SQL will only accept the simple no scan version. However, as long as the sizeInBytes

Introducing spark-sklearn, a scikit-learn integration package for Spark

2016-02-10 Thread Tim Hunter
Hello community, I would like to introduce a new Spark package that should be useful for python users who depend on scikit-learn. Among other tools: - train and evaluate multiple scikit-learn models in parallel. - convert Spark's Dataframes seamlessly into numpy arrays - (experimental)

Kafka + Spark 1.3 Integration

2016-02-10 Thread Nipun Arora
Hi, I am trying some basic integration and was going through the manual. I would like to read from a topic, and get a JavaReceiverInputDStream for messages in that topic. However the example is of JavaPairReceiverInputDStream<>. How do I get a stream for only a single topic in Java? Reference

Re: Too many open files, why changing ulimit not effecting?

2016-02-10 Thread Michael Diamant
If you are using systemd, you will need to specify the limit in the service file. I had run into this problem and discovered the solution from the following references: * https://bugzilla.redhat.com/show_bug.cgi?id=754285#c1 * http://serverfault.com/a/678861 On Fri, Feb 5, 2016 at 1:18 PM, Nirav

Re: Pyspark - how to use UDFs with dataframe groupby

2016-02-10 Thread Davies Liu
short answer: PySpark does not support UDAF (user defined aggregate function) for now. On Tue, Feb 9, 2016 at 11:44 PM, Viktor ARDELEAN wrote: > Hello, > > I am using following transformations on RDD: > > rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\ >

Re: broadcast join in SparkSQL requires analyze table noscan

2016-02-10 Thread Lan Jiang
Michael, Thanks for the reply. On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust wrote: > My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE >> compute statistics" command in Hive shell, is the statistics >> going to be used by SparkSQL to

newbie how to access S3 cluster created using spark-ec2

2016-02-10 Thread Andy Davidson
I am using spark-1.6.0 and java. I created a cluster using spark-ec2. I am having a heck of time figuring out how to write from my streaming app to AWS s3. I should mention I have never used s3 before and am not sure it is set up correctly. org.apache.hadoop.fs.s3.S3Exception:

retrieving all the rows with collect()

2016-02-10 Thread mich . talebzadeh
Hi, I have a bunch of files stored in hdfs /unit_files directory in total 319 files scala> val errlog = sc.textFile("/unix_files/*.ksh") scala> errlog.filter(line => line.contains("sed"))count() res104: Long = 1113 So it returns 1113 instances the word "sed" If I want to see the collection

Re: Kafka + Spark 1.3 Integration

2016-02-10 Thread Cody Koeninger
It's a pair because there's a key and value for each message. If you just want a single topic, put a single topic in the map of topic -> number of partitions. See https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java On

supporting adoc files in spark-packages.org

2016-02-10 Thread Kiran Chitturi
Hi, We want to add spark-solr repo (https://github.com/LucidWorks/spark-solr) to the spark-packages.org but it is currently failing due to "Cannot find README.md" (http://spark-packages.org/staging?id=882) We use adoc for our internal and external documentation and we are wondering if

Re: Spark Streaming : Limiting number of receivers per executor

2016-02-10 Thread Shixiong(Ryan) Zhu
You can't. The number of cores must be great than the number of receivers. On Wed, Feb 10, 2016 at 2:34 AM, ajay garg wrote: > Hi All, > I am running 3 executors in my spark streaming application with 3 > cores per executors. I have written my custom receiver

reading ORC format on Spark-SQL

2016-02-10 Thread Philip Lee
What kind of steps exists when reading ORC format on Spark-SQL? I meant usually reading csv file is just directly reading the dataset on memory. But I feel like Spark-SQL has some steps when reading ORC format. For example, they have to create table to insert the dataset? and then they insert the

Re: retrieving all the rows with collect()

2016-02-10 Thread Chandeep Singh
Hi Mich, If you would like to print everything to the console you could - errlog. filter(line => line.contains("sed"))collect()foreach(println) or you could always save to a file using any of the saveAs methods. Thanks, Chandeep On Wed, Feb 10, 2016 at 8:14 PM, <

legal column names

2016-02-10 Thread Richard Cobbe
I'm working with Spark 1.5.0, and I'm using the Scala API to construct DataFrames and perform operations on them. My application requires that I synthesize column names for intermediate results under some circumstances, and I don't know what the rules are for legal column names. In particular,

Re: retrieving all the rows with collect()

2016-02-10 Thread Mich Talebzadeh
Hi Chandeep Many thanks for your help In the line below errlog.filter(line => line.contains("sed"))collect()foreach(println) Can you please clarify the components with the correct naming as I am new to Scala * errlog --> is the RDD? * filter(line =>

SparkListener - why is org.apache.spark.scheduler.JobFailed in scala private?

2016-02-10 Thread Sumona Routh
Hi there, I am trying to create a listener for my Spark job to do some additional notifications for failures using this Scala API: https://spark.apache.org/docs/1.2.1/api/scala/#org.apache.spark.scheduler.JobResult . My idea was to write something like this: override def onJobEnd(jobEnd:

Re: Dataset joinWith condition

2016-02-10 Thread Ted Yu
bq. I followed something similar $"a.x" Please use expr("...") e.g. if your DataSet has two columns, you can write: ds.select(expr("_2 / _1").as[Int]) where _1 refers to first column and _2 refers to second. On Tue, Feb 9, 2016 at 3:31 PM, Raghava Mutharaju wrote:

Re: RDD distribution

2016-02-10 Thread Ted Yu
What Partitioner do you use ? Have you tried using RangePartitioner ? Cheers On Wed, Feb 10, 2016 at 3:54 PM, daze5112 wrote: > Hi im trying to improve the performance of some code im running but have > noticed that my distribution of my RDD across executors isn't

saveToCassandra doesn't overwrite column

2016-02-10 Thread Hudong Wang
Hi, This is really weird. I checked my code that I only have List[Boolean] of 7 items, the default behavior should be overwrite. I even added overwrite after the column name in SomeColumns definition but result still shows List<<77>>, etc. It seems that in some way it ignores overwrite and just

Passing a dataframe to where clause + Spark SQL

2016-02-10 Thread Divya Gehlot
Hi, //Loading all the DB Properties val options1 = Map("url" -> "jdbc:oracle:thin:@xx.xxx.xxx.xx:1521:dbname","user"->"username","password"->"password","dbtable" -> "TESTCONDITIONS") val testCond = sqlContext.load("jdbc",options1 ) val condval = testCond.select("Cond") testCond.show() val

Is this Task Scheduler Error normal?

2016-02-10 Thread SLiZn Liu
Hi Spark Users, I’m running Spark jobs on Mesos, and sometimes I get vast number of Task Scheduler Errors: ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 1161 because its task set is gone (this is likely the result of receiving duplicate task finished status updates)T It

Re: Spark execuotr Memory profiling

2016-02-10 Thread Kuchekar
Hi Nirav, I faced similar issue with Yarn, EMR 1.5.2 and following Spark Conf helped me. You can set the values accordingly conf= (SparkConf().set("spark.master","yarn-client").setAppName("HalfWay" ).set("spark.driver.memory", "15G").set("spark.yarn.am.memory","15G"))

Spark Certification

2016-02-10 Thread naga sharathrayapati
Hello All, I am planning on taking Spark Certification and I was wondering If one has to be well equipped with MLib & GraphX as well or not ? Please advise Thanks

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-10 Thread Shiva Ramagopal
How are you submitting/running the job - via spark-submit or as a plain old Java program? If you are using spark-submit, you can control the memory setting via the configuration parameter spark.executor.memory in spark-defaults.conf. If you are running it as a Java program, use -Xmx to set the

RDD uses another RDD in pyspark with SPARK-5063 issue

2016-02-10 Thread vince plum
Hi, pyspark experts, I'm trying to implement a naive Bayes lib with the same interface of pyspark.mllib.classification.NaiveBayes. train() and predict() will be the interfaces. I finished the train(LabeledPoint), but got trouble in predict() due to SPARK-5063 issue. *Exception: It appears that

Pyspark - How to add new column to dataframe based on existing column value

2016-02-10 Thread Viktor ARDELEAN
Hello, I want to add a new String column to the dataframe based on an existing column values: from pyspark.sql.functions import lit df.withColumn('strReplaced', lit(df.str.replace("a", "b").replace("c", "d"))) So basically I want to add a new column named "strReplaced", that is the same as the

Spark : Unable to connect to Oracle

2016-02-10 Thread Divya Gehlot
Hi, I am new bee to Spark and using Spark 1.5.2 version. I am trying to connect to Oracle DB using Spark API,getting errors : Steps I followed : Step 1- I placed the ojdbc6.jar in /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar Step 2- Registered the jar file

Re: Spark : Unable to connect to Oracle

2016-02-10 Thread Jorge Machado
Hi Divya, You need to install the Oracle jdbc driver on the cluster into lib folder. > On 10/02/2016, at 09:37, Divya Gehlot wrote: > > oracle.jdbc.driver.OracleDrive

Re: Spark Streaming : Limiting number of receivers per executor

2016-02-10 Thread Yogesh Mahajan
Hi Ajay, Have you overridden Receiver#preferredLocation method in your custom Receiver? You can specify hostname for your Receiver. Check the ReceiverSchedulingPolicy#scheduleReceivers, it should honor your preferredLocation value for Receiver scheduling. On Wed, Feb 10, 2016 at 4:04 PM, ajay

Re: Spark Job on YARN accessing Hbase Table

2016-02-10 Thread Ted Yu
Have you tried adding hbase client jars to spark.executor.extraClassPath ? Cheers On Wed, Feb 10, 2016 at 12:17 AM, Prabhu Joseph wrote: > + Spark-Dev > > For a Spark job on YARN accessing hbase table, added all hbase client jars > into spark.yarn.dist.files,

Spark Streaming : Limiting number of receivers per executor

2016-02-10 Thread ajay garg
Hi All, I am running 3 executors in my spark streaming application with 3 cores per executors. I have written my custom receiver for receiving network data. In my current configuration I am launching 3 receivers , one receiver per executor. In the run if 2 of my executor dies, I am left

Re: Pyspark - How to add new column to dataframe based on existing column value

2016-02-10 Thread ndjido
Hi Viktor, Try to create a UDF. It's quite simple! Ardo. > On 10 Feb 2016, at 10:34, Viktor ARDELEAN wrote: > > Hello, > > I want to add a new String column to the dataframe based on an existing > column values: > > from pyspark.sql.functions import lit >

Re: Spark : Unable to connect to Oracle

2016-02-10 Thread Rishi Mishra
ASFIK sc.addJar() will add the jars to executor's classpath . The datasource resolution ( createRelation) happens at driver side and driver classpath should contain the ojdbc6.jar. You can use "spark.driver.extraClassPath" config parameter to set the same. On Wed, Feb 10, 2016 at 3:08 PM, Jorge

Is there a way to save csv file fast ?

2016-02-10 Thread Eli Super
Hi I work with pyspark & spark 1.5.2 Currently saving rdd into csv file is very very slow , uses 2% CPU only I use : my_dd.write.format("com.databricks.spark.csv").option("header", "false").save('file:///my_folder') Is there a way to save csv faster ? Many thanks

Re: Spark Job on YARN accessing Hbase Table

2016-02-10 Thread Prabhu Joseph
Yes Ted, spark.executor.extraClassPath will work if hbase client jars is present in all Spark Worker / NodeManager machines. spark.yarn.dist.files is the easier way, as hbase client jars can be copied from driver machine or hdfs into container / spark-executor classpath automatically. No need to

Re: Spark Job on YARN accessing Hbase Table

2016-02-10 Thread Prabhu Joseph
+ Spark-Dev For a Spark job on YARN accessing hbase table, added all hbase client jars into spark.yarn.dist.files, NodeManager when launching container i.e executor, does localization and brings all hbase-client jars into executor CWD, but still the executor tasks fail with ClassNotFoundException

Re: AM creation in yarn client mode

2016-02-10 Thread Steve Loughran
On 10 Feb 2016, at 04:42, praveen S > wrote: Hi, I have 2 questions when running the spark jobs on yarn in client mode : 1) Where is the AM(application master) created : in the cluster A) is it created on the client where the job was

Re: Is there a way to save csv file fast ?

2016-02-10 Thread Steve Loughran
> On 10 Feb 2016, at 10:56, Eli Super wrote: > > Hi > > I work with pyspark & spark 1.5.2 > > Currently saving rdd into csv file is very very slow , uses 2% CPU only > > I use : > my_dd.write.format("com.databricks.spark.csv").option("header", >

Re: Is there a way to save csv file fast ?

2016-02-10 Thread Gourav Sengupta
Hi, The writes, in terms of number of records written simultaneously, can be increased if you increased the number of partitions. You can try to increase the number of partitions and check out how it works. There is though an upper cap (the one that I faced in Ubuntu) on the number of parallel

add kafka streaming jars when initialising the sparkcontext in python

2016-02-10 Thread David Kennedy
I have no problems when submitting the task using spark-submit. The --jars option with the list of jars required is successful and I see in the output the jars being added: 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR file:/usr/lib/spark/extras/lib/spark-streaming-kafka.jar at

Re: Pyspark - How to add new column to dataframe based on existing column value

2016-02-10 Thread Viktor ARDELEAN
I figured it out. Here is how it's done: from pyspark.sql.functions import udf replaceFunction = udf(lambda columnValue : columnValue.replace("\n", " ").replace('\r', " ")) df.withColumn('strReplaced', replaceFunction(df["str"])) On 10 February 2016 at 13:04, wrote: > Hi

Re: AM creation in yarn client mode

2016-02-10 Thread Manoj Awasthi
On Wed, Feb 10, 2016 at 5:20 PM, Steve Loughran wrote: > > On 10 Feb 2016, at 04:42, praveen S wrote: > > Hi, > > I have 2 questions when running the spark jobs on yarn in client mode : > > 1) Where is the AM(application master) created : > > > in

RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Mohammed Guller
Why not use the save method from the RandomForestModel class to save a model at a specified path? Mohammed Author: Big Data Analytics with Spark -Original Message- From: jluan [mailto:jaylu...@gmail.com] Sent: Wednesday, February 10, 2016 5:57 PM To: user@spark.apache.org

RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Jay Luan
Thanks for the reply, I'd like to export the decision splits for each tree out to an external file which is read elsewhere not using spark. As far as I know, saving a model to a path will save a bunch of binary files which can be loaded back into spark. Is this correct? On Feb 10, 2016 7:21 PM,

Re: Dataset joinWith condition

2016-02-10 Thread Raghava Mutharaju
Thanks a lot Ted. If the two columns are of different types say Int and Long, then will be ds.select(expr("_2 / _1").as[(Int, Long)]) Regards, Raghava. On Wed, Feb 10, 2016 at 5:19 PM, Ted Yu wrote: > bq. I followed something similar $"a.x" > > Please use expr("...") >

Computing hamming distance over large data set

2016-02-10 Thread rokclimb15
Hi everyone, new to this list and Spark, so I'm hoping someone can point me in the right direction. I'm trying to perform this same sort of task: http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql and I'm running into the same problem - it doesn't

RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Mohammed Guller
Yes, a model saved with the save method can be read back only by the load method in the RandomForestModel object. Unfortunately, I don’t know any better mechanism for what you are trying to do. There was a discussion on this topic a few days ago, so if you search the mailing list archives, you

Re: retrieving all the rows with collect()

2016-02-10 Thread Ted Yu
Mich: When you execute the statements in Spark shell, you would see the types of the intermediate results. scala> val errlog = sc.textFile("/home/john/s.out") errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out MapPartitionsRDD[1] at textFile at :24 scala> val sed = errlog.filter(line =>

RDD distribution

2016-02-10 Thread daze5112
Hi im trying to improve the performance of some code im running but have noticed that my distribution of my RDD across executors isn't exactly even (see pic below). Im using yarn and kicking it off with 11 executors. Not sure how to get a more even spread or if this is normal. thanks

Re: Rest API for spark

2016-02-10 Thread Ted Yu
Please see this thread: http://search-hadoop.com/m/q3RTtvxWU21wl78x1=Re+Spark+job+submission+REST+API On Wed, Feb 10, 2016 at 3:37 PM, Tracy Li wrote: > Hi Spark Experts, > > I am new for spark and we have requirements to support spark job, jar, sql > etc(submit, manage).

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky
Hi Mich, your assumptions 1 to 3 are all correct (nitpick: they're method *calls*, the methods being the part before the parentheses, but I assume that's what you meant). The last one is also a method call but uses syntactic sugar on top: `foreach(println)` boils down to `foreach(line =>

Rest API for spark

2016-02-10 Thread Tracy Li
Hi Spark Experts, I am new for spark and we have requirements to support spark job, jar, sql etc(submit, manage). So far I did not find any rest API bundled by spark. there have some third party lib already supported: https://github.com/spark-jobserver/spark-jobserver I want to confirm, is

Re: retrieving all the rows with collect()

2016-02-10 Thread Mich Talebzadeh
Many thanks Jakob. So it basically boils down to this demarcation as suggested which looks clearer val errlog = sc.textFile("/unix_files/*.ksh") errlog.filter(line => line.contains("sed")).collect().foreach(line => println(line)) Regards, Mich On 10/02/2016 23:21, Jakob Odersky wrote:

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky
Exactly! As a final note, `foreach` is also defined on RDDs. This means that you don't need to `collect()` the results into an array (which could give you an OutOfMemoryError in case the RDD is really really large) before printing them. Personally, when I learn using a new library, I like to look

Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-10 Thread Nirav Patel
In Yarn we have following settings enabled so that job can use virtual memory to have a capacity beyond physical memory off course. yarn.nodemanager.vmem-check-enabled false yarn.nodemanager.pmem-check-enabled false vmem to pmem ration is 2:1. However spark

Re: Rest API for spark

2016-02-10 Thread Tracy Li
Thanks Ted. That's means we have to use either livy or job-server if we want to go with REST API. Thanks, Tracy On Wed, Feb 10, 2016 at 4:13 PM, Ted Yu wrote: > Please see this thread: > > > http://search-hadoop.com/m/q3RTtvxWU21wl78x1=Re+Spark+job+submission+REST+API > >

Re: How to collect/take arbitrary number of records in the driver?

2016-02-10 Thread Jakob Odersky
Another alternative: rdd.take(1000).drop(100) //this also preserves ordering Note however that this can lead to an OOM if the data you're taking is too large. If you want to perform some operation sequentially on your driver and don't care about performance, you could do something similar as

Spark execuotr Memory profiling

2016-02-10 Thread Nirav Patel
We have been trying to solve memory issue with a spark job that processes 150GB of data (on disk). It does a groupBy operation; some of the executor will receive somehwere around (2-4M scala case objects) to work with. We are using following spark config: "executorInstances": "15",

[MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread jluan
I've trained a RandomForest classifier where I can print my model's decisions using model.toDebugString However I was wondering if there's a way to extract tree programmatically by traversing the nodes or in some other way such that I can write my own decision file rather than just a debug