Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
By the way this happens when I stooped the Driver process ... On Tue, May 19, 2015 at 12:29 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: You mean to say within Runtime.getRuntime().addShutdownHook I call ssc.stop(stopSparkContext = true, stopGracefully = true) ? This

Re: TwitterUtils on Windows

2015-05-19 Thread Akhil Das
Hi Justin, Can you try with sbt, may be that will help. - Install sbt for windows http://www.scala-sbt.org/0.13/tutorial/Installing-sbt-on-Windows.html - Create a lib directory in your project directory - Place these jars in it: - spark-streaming-twitter_2.10-1.3.1.jar -

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Sean Owen
I don't think you should rely on a shutdown hook. Ideally you try to stop it in the main exit path of your program, even in case of an exception. On Tue, May 19, 2015 at 7:59 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: You mean to say within

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
Thenka Sean . you are right. If driver program is running then I can handle shutdown in main exit path . But if Driver machine is crashed (if you just stop the application, for example killing the driver process ), then Shutdownhook is the only option isn't it ? What I try to say is , just doing

Re: group by and distinct performance issue

2015-05-19 Thread Akhil Das
Hi Peer, If you open the driver UI (running on port 4040) you can see the stages and the tasks happening inside it. Best way to identify the bottleneck for a stage is to see if there's any time spending on GC, and how many tasks are there per stage (it should be a number total # cores to achieve

Re: --jars works in yarn-client but not yarn-cluster mode, why?

2015-05-19 Thread Fengyun RAO
Thanks, Marcelo! Below is the full log, SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Tathagata Das
If you wanted to stop it gracefully, then why are you not calling ssc.stop(stopGracefully = true, stopSparkContext = true)? Then it doesnt matter whether the shutdown hook was called or not. TD On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Hi,

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-19 Thread Tathagata Das
If you dont want the fileStream to start only after certain event has happened, why not start the streamingContext after that event? TD On Sun, May 17, 2015 at 7:51 PM, Haopu Wang hw...@qilinsoft.com wrote: I want to use file stream as input. And I look at SparkStreaming document again, it's

Re: Broadcast variables can be rebroadcast?

2015-05-19 Thread N B
Hi Imran, If I understood you correctly, you are suggesting to simply call broadcast again from the driver program. This is exactly what I am hoping will work as I have the Broadcast data wrapped up and I am indeed (re)broadcasting the wrapper over again when the underlying data changes. However,

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Dibyendu Bhattacharya
You mean to say within Runtime.getRuntime().addShutdownHook I call ssc.stop(stopSparkContext = true, stopGracefully = true) ? This won't work anymore in 1.4. The SparkContext got stopped before Receiver processed all received blocks and I see below exception in logs. But if I add the

Re: org.apache.spark.shuffle.FetchFailedException :: Migration from Spark 1.2 to 1.3

2015-05-19 Thread Akhil Das
There were some similar discussion happened on JIRA https://issues.apache.org/jira/browse/SPARK-3633 may be that will give you some insights. Thanks Best Regards On Mon, May 18, 2015 at 10:49 PM, zia_kayani zia.kay...@platalytics.com wrote: Hi, I'm getting this exception after shifting my code

group by and distinct performance issue

2015-05-19 Thread Peer, Oded
I am running Spark over Cassandra to process a single table. My task reads a single days' worth of data from the table and performs 50 group by and distinct operations, counting distinct userIds by different grouping keys. My code looks like this: JavaRddRow rdd =

spark streaming doubt

2015-05-19 Thread Shushant Arora
What happnes if in a streaming application one job is not yet finished and stream interval reaches. Does it starts next job or wait for first to finish and rest jobs will keep on accumulating in queue. Say I have a streaming application with stream interval of 1 sec, but my job takes 2 min to

Re: Spark and Flink

2015-05-19 Thread Pa Rö
it's sound good, maybe you can send me pseudo structure, that is my fist maven project. best regards, paul 2015-05-18 14:05 GMT+02:00 Robert Metzger rmetz...@apache.org: Hi, I would really recommend you to put your Flink and Spark dependencies into different maven modules. Having them both

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
It will be a single job running at a time by default (you can also configure the spark.streaming.concurrentJobs to run jobs parallel which is not recommended to put in production). Now, your batch duration being 1 sec and processing time being 2 minutes, if you are using a receiver based

Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Night Wolf
Hi all, I have a job that, for every row, creates about 20 new objects (i.e. RDD of 100 rows in = RDD 2000 rows out). The reason for this is each row is tagged with a list of the 'buckets' or 'windows' it belongs to. The actual data is about 10 billion rows. Each executor has 60GB of memory.

Re: py-files (and others?) not properly set up in cluster-mode Spark Yarn job?

2015-05-19 Thread Shay Rojansky
Thanks for the quick response and confirmation, Marcelo, I just opened https://issues.apache.org/jira/browse/SPARK-7725. On Mon, May 18, 2015 at 9:02 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Shay, Yeah, that seems to be a bug; it doesn't seem to be related to the default FS nor

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
spark.streaming.concurrentJobs takes an integer value, not boolean. If you set it as 2 then 2 jobs will run parallel. Default value is 1 and the next job will start once it completes the current one. Actually, in the current implementation of Spark Streaming and under default configuration,

Re: TwitterUtils on Windows

2015-05-19 Thread Steve Loughran
On 19 May 2015, at 03:08, Justin Pihony justin.pih...@gmail.com wrote: 15/05/18 22:03:14 INFO Executor: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with timestamp 1432000973058 15/05/18 22:03:14 INFO Utils: Fetching

Re: Working with slides. How do I know how many times a RDD has been processed?

2015-05-19 Thread Guillermo Ortiz
I tried to insert an flag in the RDD, so I could set in the last position a counter, when the counter gets X, I could do something. But in each slide comes the original RDD although I modificated it. I did this code to check if this is possible but it doesn't work. val rdd1WithFlag = rdd1.map {

AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Hi all, I might be missing something, but does the new Spark 1.3 sqlContext save interface support using Avro as the schema structure when writing Parquet files, in a similar way to AvroParquetWriter (which I've got working)? I've seen how you can load an avro file and save it as parquet from

How to use spark to access HBase with Security enabled

2015-05-19 Thread donhoff_h
Hi, experts. I ran the HBaseTest program which is an example from the Apache Spark source code to learn how to use spark to access HBase. But I met the following exception: Exception in thread main org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:

Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Cheng Lian
That's right. Also, Spark SQL can automatically infer schema from JSON datasets. You don't need to specify an Avro schema: sqlContext.jsonFile(json/path).saveAsParquetFile(parquet/path) or with the new reader/writer API introduced in 1.4-SNAPSHOT:

RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that's brilliant, you've saved me a headache. Ewan From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: 19 May 2015 11:58 To: Ewan Leith; user@spark.apache.org Subject: Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces? That's right.

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I am using spark 1.3.1 Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:34 PM, Wangfei (X) wangf...@huawei.com wrote: And which version are you using 发自我的 iPhone 在 2015年5月19日,18:29,ayan guha guha.a...@gmail.com 写道: can you kindly share your code? On

Re: Spark SQL on large number of columns

2015-05-19 Thread Wangfei (X)
And which version are you using 发自我的 iPhone 在 2015年5月19日,18:29,ayan guha guha.a...@gmail.commailto:guha.a...@gmail.com 写道: can you kindly share your code? On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.commailto:phatak@gmail.com wrote: Hi, I am trying run spark sql

RE: Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Evo Eftimov
Is that a Spark or Spark Streaming application Re the map transformation which is required you can also try flatMap Finally an Executor is essentially a JVM spawn by a Spark Worker Node or YARN – giving 60GB RAM to a single JVM will certainly result in “off the charts” GC. I would

Re: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Cheng Lian
Hi Ewan, Different from AvroParquetWriter, in Spark SQL we uses StructType as the intermediate schema format. So when converting Avro files to Parquet files, we internally converts Avro schema to Spark SQL StructType first, and then convert StructType to Parquet schema. Cheng On 5/19/15

Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark is taking huge amount of time to parse the sql and create a logical plan. Even if i have just one row, it's taking more than 1 hour just to get pass the parsing. Any

RE: AvroParquetWriter equivalent in Spark 1.3 sqlContext Save or createDataFrame Interfaces?

2015-05-19 Thread Ewan Leith
Thanks Cheng, that makes sense. So for new dataframe creation (not conversion from Avro but from JSON or CSV inputs) in Spark we shouldn't worry about using Avro at all, just use the Spark SQL StructType when building new Dataframes? If so, that will be a lot simpler! Thanks, Ewan From: Cheng

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, I have fields from field_0 to fied_26000. The query is select on max( cast($columnName as double)), |min(cast($columnName as double)), avg(cast($columnName as double)), count(*) for all those 26000 fields in one query. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19,

Reading Binary files in Spark program

2015-05-19 Thread Tapan Sharma
Hi Team, I am new to Spark and learning. I am trying to read image files into spark job. This is how I am doing: Step 1. Created sequence files with FileName as Key and Binary image as value. i.e. Text and BytesWritable. I am able to read these sequence files into Map Reduce programs. Step 2. I

Re: Spark SQL on large number of columns

2015-05-19 Thread ayan guha
can you kindly share your code? On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am running into issue that spark is taking huge amount of time to parse the sql and

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, An additional information is, table is backed by a csv file which is read using spark-csv from databricks. Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:05 PM, madhu phatak phatak@gmail.com wrote: Hi, I have fields from field_0 to fied_26000. The query

Re: How to use spark to access HBase with Security enabled

2015-05-19 Thread Ted Yu
Which user did you run your program as ? Have you granted proper permission on hbase side ? You should also check master log to see if there was some clue. Cheers On May 19, 2015, at 2:41 AM, donhoff_h 165612...@qq.com wrote: Hi, experts. I ran the HBaseTest program which is an

Re: Spark Job not using all nodes in cluster

2015-05-19 Thread ayan guha
What is your spark env file says? Are you setting number of executors in spark context? On 20 May 2015 13:16, Shailesh Birari sbirar...@gmail.com wrote: Hi, I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB of RAM. I have around 600,000+ Json files on HDFS. Each file

Re: Spark logo license

2015-05-19 Thread Matei Zaharia
Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ http://www.apache.org/foundation/marks/ Matei On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote: What is the license on using the spark logo. Is it free to be used for displaying

Re: Spark logo license

2015-05-19 Thread Justin Pihony
Thanks! On Wed, May 20, 2015 at 12:41 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ Matei On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote: What is the license on using the

Hive on Spark VS Spark SQL

2015-05-19 Thread guoqing0...@yahoo.com.hk
Hive on Spark and SparkSQL which should be better , and what are the key characteristics and the advantages and the disadvantages between ? guoqing0...@yahoo.com.hk

Re: Spark Streaming to Kafka

2015-05-19 Thread Saisai Shao
I think here is the PR https://github.com/apache/spark/pull/2994 you could refer to. 2015-05-20 13:41 GMT+08:00 twinkle sachdeva twinkle.sachd...@gmail.com: Hi, As Spark streaming is being nicely integrated with consuming messages from Kafka, so I thought of asking the forum, that is there

Re: Reading Binary files in Spark program

2015-05-19 Thread Tapan Sharma
Thanks. I will try and let you know. But what exactly is an issue? Any pointers? Regards Tapan On Tue, May 19, 2015 at 6:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try something like: JavaPairRDDIntWritable, Text output = sc.newAPIHadoopFile(inputDir,

Spark Streaming to Kafka

2015-05-19 Thread twinkle sachdeva
Hi, As Spark streaming is being nicely integrated with consuming messages from Kafka, so I thought of asking the forum, that is there any implementation available for pushing data to Kafka from Spark Streaming too? Any link(s) will be helpful. Thanks and Regards, Twinkle

sparkSQL - Hive metastore connection hangs with MS SQL server

2015-05-19 Thread jamborta
Hi all, I am trying to setup an external metastore using Microsoft SQL on Azure, it works ok initially but after about 5 mins inactivity it hangs, then times out after 15 mins with this error: 15/05/20 00:02:49 ERROR ConnectionHandle: Database access problem. Killing off this connection and all

Spark users

2015-05-19 Thread Ricardo Goncalves da Silva
Hi I'm learning spark focused on data and machine learning. Migrating from SAS. There is a group for it? My questions are basic for now and I having very few answers. Tal Rick. Enviado do meu smartphone Samsung Galaxy. Este mensaje y sus adjuntos se

?????? How to use spark to access HBase with Security enabled

2015-05-19 Thread donhoff_h
Sorry, this ref does not help me. I have set up the configuration in hbase-site.xml. But it seems there are still some extra configurations to be set or APIs to be called to make my spark program be able to pass the authentication with the HBase. Does anybody know how to set authentication to

RE: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Chandra Mohan, Ananda Vel Murugan
Hi, Thanks for the response. I was looking for a java solution. I will check the scala and python ones. Regards, Anand.C From: Todd Nist [mailto:tsind...@gmail.com] Sent: Tuesday, May 19, 2015 6:17 PM To: Chandra Mohan, Ananda Vel Murugan Cc: ayan guha; user Subject: Re: Spark sql error while

Spark Job not using all nodes in cluster

2015-05-19 Thread Shailesh Birari
Hi, I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB of RAM. I have around 600,000+ Json files on HDFS. Each file is small around 1KB in size. Total data is around 16GB. Hadoop block size is 256MB. My application reads these files with sc.textFile() (or sc.jsonFile()

Re: Reading Binary files in Spark program

2015-05-19 Thread Tapan Sharma
Problem is still there. Exception is not coming at the time of reading. Also the count of JavaPairRDD is as expected. It is when we are calling collect() or toArray() methods, the exception is coming. Something to do with Text class even though I haven't used it in the program. Regards Tapan On

Re: EOFException using KryoSerializer

2015-05-19 Thread Imran Rashid
Hi Jim, this is definitley strange. It sure sounds like a bug, but it also is a very commonly used code path, so it at the very least you must be hitting a corner case. Could you share a little more info with us? What version of spark are you using? How big is the object you are trying to

spark 1.3.1 jars in repo1.maven.org

2015-05-19 Thread Edward Sargisson
Hi, I'd like to confirm an observation I've just made. Specifically that spark is only available in repo1.maven.org for one Hadoop variant. The Spark source can be compiled against a number of different Hadoops using profiles. Yay. However, the spark jars in repo1.maven.org appear to be compiled

Re: spark 1.3.1 jars in repo1.maven.org

2015-05-19 Thread Ted Yu
I think your observation is correct. e.g. http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.3.1 shows that it depends on hadoop-client http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client from hadoop 2.2 Cheers On Tue, May 19, 2015 at 6:17 PM, Edward Sargisson

Spark logo license

2015-05-19 Thread Justin Pihony
What is the license on using the spark logo. Is it free to be used for displaying commercially? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Find KNN in Spark SQL

2015-05-19 Thread Debasish Das
The batch version of this is part of rowSimilarities JIRA 4823 ...if your query points can fit in memory there is broadcast version which we are experimenting with internallywe are using brute force KNN right now in the PR...based on flann paper lsh did not work well but before you go to

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, Tested for calculating values for 300 columns. Analyser takes around 4 minutes to generate the plan. Is this normal? Regards, Madhukara Phatak http://datamantra.io/ On Tue, May 19, 2015 at 4:35 PM, madhu phatak phatak@gmail.com wrote: Hi, I am using spark 1.3.1 Regards,

Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Todd Nist
I believe your looking for df.na.fill in scala, in pySpark Module it is fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html) from the docs: df4.fillna({'age': 50, 'name': 'unknown'}).show()age height name10 80 Alice5 null Bob50 null Tom50 null unknown On

Re: Broadcast variables can be rebroadcast?

2015-05-19 Thread Imran Rashid
hmm, I guess it depends on the way you look at it. In a way, I'm saying that spark does *not* have any built in auto-re-broadcast if you try to mutate a broadcast variable. Instead, you should create something new, and just broadcast it separately. Then just have all the code you have operating

?????? How to use spark to access HBase with Security enabled

2015-05-19 Thread donhoff_h
The principal is sp...@bgdt.dev.hrb. It is the user that I used to run my spark programs. I am sure I have run the kinit command to make it take effect. And I also used the HBase Shell to verify that this user has the right to scan and put the tables in HBase. Now I still have no idea how to

RE: Decision tree: categorical variables

2015-05-19 Thread Keerthi
Hi , can you pls share how you resolved the parsing issue. It would be of great help... Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-categorical-variables-tp12433p22943.html Sent from the Apache Spark User List mailing list

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, Another update, when run on more that 1000 columns I am getting Could not write class __wrapper$1$40255d281a0d4eacab06bcad6cf89b0d/__wrapper$1$40255d281a0d4eacab06bcad6cf89b0d$$anonfun$wrapper$1$$anon$1 because it exceeds JVM code size limits. Method apply's code too large! Regards,

Re: Reading Real Time Data only from Kafka

2015-05-19 Thread Akhil Das
Cool. Thanks for the detailed response Cody. Thanks Best Regards On Tue, May 19, 2015 at 6:43 PM, Cody Koeninger c...@koeninger.org wrote: If those questions aren't answered by https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md please let me know so I can update it.

Re: Spark and Flink

2015-05-19 Thread Till Rohrmann
I guess it's a typo: eu.stratosphere should be replaced by org.apache.flink On Tue, May 19, 2015 at 1:13 PM, Alexander Alexandrov alexander.s.alexand...@gmail.com wrote: We managed to do this with the following config: // properties !-- Hadoop --

Hive in IntelliJ

2015-05-19 Thread Heisenberg Bb
I was trying to implement this example: http://spark.apache.org/docs/1.3.1/sql-programming-guide.html#hive-tables It worked well when I built spark in terminal using command specified: http://spark.apache.org/docs/1.3.1/building-spark.html#building-with-hive-and-jdbc-support But when I try to

Re: Spark SQL on large number of columns

2015-05-19 Thread madhu phatak
Hi, Tested with HiveContext also. It also take similar amount of time. To make the things clear, the following is select clause for a given column *aggregateStats( $columnName , max( cast($columnName as double)), |min(cast($columnName as double)), avg(cast($columnName as double)), count(*) )*

Re: Reading Real Time Data only from Kafka

2015-05-19 Thread Cody Koeninger
If those questions aren't answered by https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md please let me know so I can update it. If you set auto.offset.reset to largest, it will start at the largest offset. Any messages before that will be skipped, so if prior runs of the

Re: spark streaming doubt

2015-05-19 Thread Shushant Arora
So for Kafka+spark streaming, Receiver based streaming used highlevel api and non receiver based streaming used low level api. 1.In high level receiver based streaming does it registers consumers at each job start(whenever a new job is launched by streaming application say at each second)? 2.No

Re: PySpark Job throwing IOError

2015-05-19 Thread Muralidhar, Nikhil
Hello all, I have an error in pyspark for which I have not the faintest idea of the cause. All I can tell from the stack trace is that it can't find a pyspark file on the path /mnt/spark-*/pyspark-*. Apart from that I need someone more experienced than me with Spark to look into it and help

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
On Tue, May 19, 2015 at 8:10 PM, Shushant Arora shushantaror...@gmail.com wrote: So for Kafka+spark streaming, Receiver based streaming used highlevel api and non receiver based streaming used low level api. 1.In high level receiver based streaming does it registers consumers at each job

Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-19 Thread Tomasz Fruboes
Dear Experts, we have a spark cluster (standalone mode) in which master and workers are started from root account. Everything runs correctly to the point when we try doing operations such as dataFrame.select(name, age).save(ofile, parquet) or rdd.saveAsPickleFile(ofile) , where

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the coarse-grained scheduler, there is a spark.cores.max config setting that will limit the total # of cores it grabs. This was there in earlier versions too. Matei On May 19, 2015, at 12:39 PM, Thomas Dudziak

PanTera Big Data Visualization built with Spark

2015-05-19 Thread Cyrus Handy
Hi, Can you please add us to the list of Spark Users Org: PanTera URL: http://pantera.io Components we are using: - PanTera uses a direct access to the Spark Scala API - Spark Core ­ SparkContext, JavaSparkContext, SparkConf, RDD, JavaRDD, - Accumulable, AccumulableParam, Accumulator,

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
I'm using fine-grained for a multi-tenant environment which is why I would welcome the limit of tasks per job :) cheers, Tom On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the

Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Thomas Dudziak
I read the other day that there will be a fair number of improvements in 1.4 for Mesos. Could I ask for one more (if it isn't already in there): a configurable limit for the number of tasks for jobs run on Mesos ? This would be a very simple yet effective way to prevent a job dominating the

Code error

2015-05-19 Thread Ricardo Goncalves da Silva
Hi, Can anybody see what's wrong in this piece of code: ./bin/spark-shell --num-executors 2 --executor-memory 512m --master yarn-client import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors val data = sc.textFile(/user/p_loadbd/fraude5.csv).map(x =

Re: spark streaming doubt

2015-05-19 Thread Dibyendu Bhattacharya
Just to add, there is a Receiver based Kafka consumer which uses Kafka Low Level Consumer API. http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com wrote: On Tue, May 19, 2015 at 8:10 PM,

Re: Decision tree: categorical variables

2015-05-19 Thread Ram Sriharsha
Hi Keerthi As Xiangrui mentioned in the reply, the categorical variables are assumed to be encoded as integers between 0 and k - 1, if k is the parameter you are passing as the category info map. So you will need to handle this during parsing (your columns 3 and 6 need to be converted into ints

Re: Broadcast variables can be rebroadcast?

2015-05-19 Thread N B
Thanks Imran. It does help clarify. I believe I had it right all along then but was confused by documentation talking about never changing the broadcasted variables. I've tried it on a local mode process till now and does seem to work as intended. When (and if !) we start running on a real

Re: Reading Binary files in Spark program

2015-05-19 Thread Akhil Das
Try something like: JavaPairRDDIntWritable, Text output = sc.newAPIHadoopFile(inputDir, org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class, IntWritable.class, Text.class, new Job().getConfiguration()); With the type of input format that you require. Thanks Best

Re: group by and distinct performance issue

2015-05-19 Thread Todd Nist
You may want to look at this tooling for helping identify performance issues and bottlenecks: https://github.com/kayousterhout/trace-analysis I believe this is slated to become part of the web ui in the 1.4 release, in fact based on the status of the JIRA,

Re: How to use spark to access HBase with Security enabled

2015-05-19 Thread Ted Yu
Please take a look at: http://hbase.apache.org/book.html#_client_side_configuration_for_secure_operation Cheers On Tue, May 19, 2015 at 5:23 AM, donhoff_h 165612...@qq.com wrote: The principal is sp...@bgdt.dev.hrb. It is the user that I used to run my spark programs. I am sure I have run

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Yeah, this definitely seems useful there. There might also be some ways to cap the application in Mesos, but I'm not sure. Matei On May 19, 2015, at 1:11 PM, Thomas Dudziak tom...@gmail.com wrote: I'm using fine-grained for a multi-tenant environment which is why I would welcome the limit

Re: spark streaming doubt

2015-05-19 Thread Shushant Arora
Thanks Akhil andDibyendu. Does in high level receiver based streaming executors run on receivers itself to have data localisation ? Or its always data is transferred to executor nodes and executor nodes differ in each run of job but receiver node remains same(same machines) throughout life of

Re: Getting the best parameter set back from CrossValidatorModel

2015-05-19 Thread Joseph Bradley
Hi Justin Ram, To clarify, PipelineModel.stages is not private[ml]; only the PipelineModel constructor is private[ml]. So it's safe to use pipelineModel.stages as a Spark user. Ram's example looks good. Btw, in Spark 1.4 (and the current master build), we've made a number of improvements to

Re: How to implement an Evaluator for a ML pipeline?

2015-05-19 Thread Xiangrui Meng
The documentation needs to be updated to state that higher metric values are better (https://issues.apache.org/jira/browse/SPARK-7740). I don't know why if you negate the return value of the Evaluator you still get the highest regularization parameter candidate. Maybe you should check the log

Add to Powered by Spark page

2015-05-19 Thread Michal Klos
Hi, We would like to be added to the Powered by Spark list: organization name: Localytics URL: http://eng.localytics.com/ a list of which Spark components you are using: Spark, Spark Streaming, MLLib a short description of your use case: Batch, real-time, and predictive analytics driving our

Re: rdd.sample() methods very slow

2015-05-19 Thread Sean Owen
The way these files are accessed is inherently sequential-access. There isn't a way to in general know where record N is in a file like this and jump to it. So they must be read to be sampled. On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Hi

Exception when using CLUSTER BY or ORDER BY

2015-05-19 Thread Thomas Dudziak
Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext (Spark 1.3.1 on Mesos, fine-grained mode). Is this a known problem or should I file a JIRA for it ? org.apache.spark.SparkException: Can only zip RDDs with same

Re: Naming an DF aggregated column

2015-05-19 Thread Michael Armbrust
customerDF.groupBy(state).agg(max($discount).alias(newName)) (or .as(...), both functions can take a String or a Symbol) On Tue, May 19, 2015 at 2:11 PM, Cesar Flores ces...@gmail.com wrote: I would like to ask if there is a way of specifying the column name of a data frame aggregation. For

Re: Word2Vec with billion-word corpora

2015-05-19 Thread Xiangrui Meng
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would work. You might try increasing the minCount to decrease the vocabulary size and reduce the

Hive 1.0 support in Spark

2015-05-19 Thread Kannan Rajah
Does Spark 1.3.1 support Hive 1.0? If not, which version of Spark will start supporting Hive 1.0? -- Kannan

Re: Discretization

2015-05-19 Thread Xiangrui Meng
Thanks for asking! We should improve the documentation. The sample dataset is actually mimicking the MNIST digits dataset, where the values are gray levels (0-255). So by dividing by 16, we want to map it to 16 coarse bins for the gray levels. Actually, there is a bug in the doc, we should convert

Naming an DF aggregated column

2015-05-19 Thread Cesar Flores
I would like to ask if there is a way of specifying the column name of a data frame aggregation. For example If I do: customerDF.groupBy(state).agg(max($discount)) the name of my aggregated column will be: MAX('discount) Is there a way of changing the name of that column to something else on

Re: Does Python 2.7 have to be installed on every cluster node?

2015-05-19 Thread Davies Liu
PySpark work with CPython by default, and you can specify which version of Python to use by: PYSPARK_PYTHON=path/to/path bin/spark-submit xxx.py When you do the upgrade, you could install python 2.7 on every machine in the cluster, test it with PYSPARK_PYTHON=python2.7 bin/spark-submit xxx.py

Re: Stratified sampling with DataFrames

2015-05-19 Thread Xiangrui Meng
You need to convert DataFrame to RDD, call sampleByKey, and then apply the schema back to create DataFrame. val df: DataFrame = ... val schema = df.schema val sampledRDD = df.rdd.keyBy(r = r.getAs[Int](0)).sampleByKey(...).values val sampled = sqlContext.createDataFrame(sampledRDD, schema)

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Cody Koeninger
If you checkpoint, the job will start from the successfully consumed offsets. If you don't checkpoint, by default it will start from the highest available offset, and you will potentially lose data. Is the link I posted, or for that matter the scaladoc, really not clear on that point? The

How to set the file size for parquet Part

2015-05-19 Thread Richard Grossman
Hi I'm using spark 1.3.1 and now I can't set the size of the part generated file for parquet. The size is only 512Kb it's really to small I must made them bigger. How can set this ? Thanks

Re: RandomSplit with Spark-ML and Dataframe

2015-05-19 Thread Olivier Girardot
Thank you ! Le mar. 19 mai 2015 à 21:08, Xiangrui Meng men...@gmail.com a écrit : In 1.4, we added RAND as a DataFrame expression, which can be used for random split. Please check the example here: https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214.

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-19 Thread Xiangrui Meng
Hey Jaonary, I saw this line in the error message: org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(dataTypes.scala:163) CaseClassStringParser is only used in older versions of Spark to parse schema from JSON. So I suspect that the cluster was running on a old version of Spark

Re: Increase maximum amount of columns for covariance matrix for principal components

2015-05-19 Thread Xiangrui Meng
We use a dense array to store the covariance matrix on the driver node. So its length is limited by the integer range, which is 65536 * 65536 (actually half). -Xiangrui On Wed, May 13, 2015 at 1:57 AM, Sebastian Alfers sebastian.alf...@googlemail.com wrote: Hello, in order to compute a huge

Re: k-means core function for temporal geo data

2015-05-19 Thread Xiangrui Meng
I'm not sure whether k-means would converge with this customized distance measure. You can list (weighted) time as a feature along with coordinates, and then use Euclidean distance. For other supported distance measures, you can check Derrick's package:

Re: spark mllib kmeans

2015-05-19 Thread Xiangrui Meng
Just curious, what distance measure do you need? -Xiangrui On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: take a look at this https://github.com/derrickburns/generalized-kmeans-clustering Best, Jao On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko

  1   2   >