Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread Sourav Mazumder
Hi Ayan, Thanks for your response. In my case the constraint is I have to use Hive 0.14 for some other usecases. I believe the incompatibility is at the thrift server level (the hiveserver 2 which comes with hive). If I use Hive 0.13 hiverserver 2 in the same node as of spark master should that

Re: Spark sql and csv data processing question

2015-05-16 Thread Don Drake
Your parenthesis don't look right as you're embedding the filter on the Row.fromSeq(). Try this: val trainRDD = rawTrainData .filter(!_.isEmpty) .map(rawRow = Row.fromSeq(rawRow.split(,))) .filter(_.length == 15) .map(_.toString).map(_.trim) -Don On Fri,

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
ah, that explains it, many thanks! On Sat, May 16, 2015 at 7:41 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: oh...metastore_db location is not controlled by hive.metastore.warehouse.dir -- one is the location of your metastore DB, the other is the physical location of your stored data.

Problem building master on 2.11

2015-05-16 Thread Fernando O.
Is anyone else having issues when building spark from git? I created a jira ticket with a Docker file that reproduces the issue. The error: /spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() {

Spark SQL is not able to connect to hive metastore

2015-05-16 Thread smazumder
HI, I'm trying to execute simple sql statement from spark-shell val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) - This one executes properly. Next I'm trying - sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) This keeps on trying to connect Metastore but

IF in SQL statement

2015-05-16 Thread Antony Mayi
Hi, is it expected I can't reference a column inside of IF statement like this: sctx.sql(SELECT name, IF(ts0, price, 0) FROM table).collect() I get an error: org.apache.spark.sql.AnalysisException: unresolved operator 'Project [name#0,if ((CAST(ts#1, DoubleType) CAST(0, DoubleType))) price#2

RE: Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2015-05-16 Thread jaredtims
Any resolution to this? Im having the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-YARN-on-AWS-EMR-Issues-finding-file-on-hdfs-tp10214p22918.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
Gave it another try - it seems that it picks up the variable and prints out the correct value, but still puts the metatore_db folder in the current directory, regardless. On Sat, May 16, 2015 at 1:13 PM, Tamas Jambor jambo...@gmail.com wrote: Thank you for the reply. I have tried your

Re: store hive metastore on persistent store

2015-05-16 Thread Yana Kadiyska
oh...metastore_db location is not controlled by hive.metastore.warehouse.dir -- one is the location of your metastore DB, the other is the physical location of your stored data. Checkout this SO thread: http://stackoverflow.com/questions/13624893/metastore-db-created-wherever-i-run-hive On Sat,

Re: zip files submitted with --py-files disappear from hdfs after a while on EMR

2015-05-16 Thread jaredtims
Any resolution to this? I am having the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zip-files-submitted-with-py-files-disappear-from-hdfs-after-a-while-on-EMR-tp22342p22919.html Sent from the Apache Spark User List mailing list archive at

Re: IF in SQL statement

2015-05-16 Thread ayan guha
Try like this SELECT name, case when ts0 then price else 0 end from table On Sun, May 17, 2015 at 12:21 AM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, is it expected I can't reference a column inside of IF statement like this: sctx.sql(SELECT name, IF(ts0, price, 0) FROM

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
Hi Try with Hive 0.13. If I am not wrong, Hive 0.14 is not supported yet, definitely not with 1.2.1 :) On Sun, May 17, 2015 at 1:14 AM, smazumder sourav.mazumde...@gmail.com wrote: HI, I'm trying to execute simple sql statement from spark-shell val sqlContext = new

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
Here is from documentation: Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently Spark SQL is based on Hive 0.12.0 and 0.13.1. On Sun, May 17, 2015 at 1:48 AM, ayan guha guha.a...@gmail.com wrote: Hi Try with Hive 0.13. If I am not wrong, Hive 0.14 is

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Nisrina Luthfiyati
Hi Ayan and Helena, I've considered using Cassandra/HBase but ended up opting to save to worker hdfs because I want to take advantage of the data locality since the data will often be loaded to Spark for further processing. I was also under the impression that saving to filesystem (instead of db)

Re: Custom Aggregate Function for DataFrame

2015-05-16 Thread ayan guha
Hi If you asked to any DB developer, s/he would tell you the construct: select userid,time,state, rank() over (partition by userId order by time desc) r from event) where r=1 I am not sure if Dataframe supports it, though I am sure we can extend functions to implement it. But here is one not

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
Thank you for the reply. I have tried your experiment, it seems that it does not print the settings out in spark-shell (I'm using 1.3 by the way). Strangely I have been experimenting with an SQL connection instead, which works after all (still if I go to spark-shell and try to print out the SQL

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Helena Edelson
Consider using cassandra with spark streaming and timeseries, cassandra has been doing time series for years. Here’s some snippets with kafka streaming and writing/reading the data back:

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-16 Thread Tathagata Das
For the Spark Streaming app, if you want a particular action inside a foreachRDD to go to a particular pool, then make sure you set the pool within the foreachRDD function. E.g. dstream.foreachRDD { rdd = rdd.sparkContext.setLocalProperty(spark.scheduler.pool, pool1) // set the pool

Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-16 Thread Ilya Ganelin
All - this issue showed up when I was tearing down a spark context and creating a new one. Often, I was unable to then write to HDFS due to this error. I subsequently switched to a different implementation where instead of tearing down and re initializing the spark context I'd instead submit a

Re: number of executors

2015-05-16 Thread Ted Yu
What Spark release are you using ? Can you check driver log to see if there is some clue there ? Thanks On Sat, May 16, 2015 at 12:01 AM, xiaohe lan zombiexco...@gmail.com wrote: Hi, I have a 5 nodes yarn cluster, I used spark-submit to submit a simple app. spark-submit --master yarn

Getting the best parameter set back from CrossValidatorModel

2015-05-16 Thread Justin Yip
Hello, I am using MLPipeline. I would like to extract the best parameter found by CrossValidator. But I cannot find much document about how to do it. Can anyone give me some pointers? Thanks. Justin -- View this message in context:

How can I do pair-wise computation between RDD feature columns?

2015-05-16 Thread yaochunnan
Hi all, Recently I've ran into a scenario to conduct two sample tests between all paired combination of columns of an RDD. But the networking load and generation of pair-wise computation is too time consuming. That has puzzled me for a long time. I want to conduct Wilcoxon rank-sum test