Spark schema evolution

2016-03-22 Thread gtinside
Hi , I have a table sourced from* 2 parquet files* with few extra columns in one of the parquet file. Simple * queries works fine but queries with predicate on extra column doesn't work and I get column not found *Column resp_party_type exist in just one parquet file* a) Query that work :

Re: Spark SQL Optimization

2016-03-21 Thread gtinside
More details : Execution plan for Original query select distinct pge.portfolio_code from table1 pge join table2 p on p.perm_group = pge.anc_port_group join table3 uge on p.user_group=uge.anc_user_group where uge.user_name = 'user' and p.perm_type = 'TEST' == Physical Plan ==

Spark SQL Optimization

2016-03-21 Thread gtinside
Hi , I am trying to execute a simple query with join on 3 tables. When I look at the execution plan , it varies with position of table in the "from" clause. Execution plan looks more optimized when the position of table with predicates is specified before any other table. Original query :

Error while saving parquet

2015-09-22 Thread gtinside
Please refer to the code snippet below . I get following error */tmp/temp/trade.parquet/part-r-00036.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [20, -28, -93, 93] at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418) at

Group by specific key and save as parquet

2015-08-31 Thread gtinside
Hi , I have a set of data, I need to group by specific key and then save as parquet. Refer to the code snippet below. I am querying trade and then grouping by date val df = sqlContext.sql("SELECT * FROM trade") val dfSchema = df.schema val partitionKeyIndex =

Spark SQL 1.3 max operation giving wrong results

2015-03-13 Thread gtinside
Hi , I am playing around with Spark SQL 1.3 and noticed that max function does not give the correct result i.e doesn't give the maximum value. The same query works fine in Spark SQL 1.2 . Is any one aware of this issue ? Regards, Gaurav -- View this message in context:

Re: Integer column in schema RDD from parquet being considered as string

2015-03-06 Thread gtinside
Hi tsingfu , Thanks for your reply, I tried with other columns but the problem is same with other Integer columns. Regards, Gaurav -- View this message in context:

Integer column in schema RDD from parquet being considered as string

2015-03-04 Thread gtinside
Hi , I am coverting jsonRDD to parquet by saving it as parquet file (saveAsParquetFile) cacheContext.jsonFile(file:///u1/sample.json).saveAsParquetFile(sample.parquet) I am reading parquet file and registering it as a table : val parquet = cacheContext.parquetFile(sample_trades.parquet)

NullPointerException in TaskSetManager

2015-02-26 Thread gtinside
Hi , I am trying to run a simple hadoop job (that uses CassandraHadoopInputOutputWriter) on spark (v1.2 , Hadoop v 1.x) but getting NullPointerException in TaskSetManager WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost task 14.2 in stage 0.0 (TID 29,

Thriftserver Beeline

2015-02-18 Thread gtinside
Hi , I created some hive tables and trying to list them through Beeline , but not getting any results. I can list the tables through spark-sql. When I connect beeline, it starts up with following message : Connecting to jdbc:hive2://tst001:10001 Enter username for jdbc:hive2://tst001:10001:

SPARK_LOCAL_DIRS and SPARK_WORKER_DIR

2015-02-11 Thread gtinside
Hi , What is the difference between SPARK_LOCAL_DIRS and SPARK_WORKER_DIR ? Also does spark clean these up after the execution ? Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-LOCAL-DIRS-and-SPARK-WORKER-DIR-tp21612.html Sent from

Re: Web Service + Spark

2015-01-09 Thread gtinside
You can also look at Spark Job Server https://github.com/spark-jobserver/spark-jobserver - Gaurav On Jan 9, 2015, at 10:25 PM, Corey Nolet cjno...@gmail.com wrote: Cui Lin, The solution largely depends on how you want your services deployed (Java web container, Spray framework, etc...)

Data Locality

2015-01-06 Thread gtinside
Does spark guarantee to push the processing to the data ? Before creating tasks does spark always check for data location ? So for example if I have 3 spark nodes (Node1, Node2, Node3) and data is local to just 2 nodes (Node1 and Node2) , will spark always schedule tasks on the node for which the

unread block data when reading from NFS

2014-12-13 Thread gtinside
Hi , I am trying to read a csv file in the following way : val csvData = sc.textFile(file:///tmp/sample.csv) csvData.collect().length This works file on spark-shell but when I try to do spark-submit of the jar, I get the following exceptions : java.lang.IllegalStateException: unread block data

Integrating Spark with other applications

2014-11-07 Thread gtinside
Hi , I have been working on Spark SQL and want to expose this functionality to other applications. Idea is to let other applications to send sql to be executed on spark cluster and get the result back. I looked at spark job server (https://github.com/ooyala/spark-jobserver) but it provides a

Spark SQL on XML files

2014-10-18 Thread gtinside
Hi , I have bunch of Xml files and I want to run spark SQL on it, is there a recommended approach ? I am thinking of either converting Xml in json and then jsonRDD Please let me know your thoughts Regards, Gaurav -- View this message in context:

Spark SQL CLI

2014-09-22 Thread gtinside
Hi , I have been using spark shell to execute all SQLs. I am connecting to Cassandra , converting the data in JSON and then running queries on it, I am using HiveContext (and not SQLContext) because of explode functionality in it. I want to see how can I use Spark SQL CLI for directly running

Re: flattening a list in spark sql

2014-09-10 Thread gtinside
Hi , Thanks it worked, really appreciate your help. I have also been trying to do multiple Lateral Views, but it doesn't seem to be working. Query : hiveContext.sql(Select t2 from fav LATERAL VIEW explode(TABS) tabs1 as t1 LATERAL VIEW explode(t1) tabs2 as t2) Exception

Re: Cassandra connector

2014-09-10 Thread gtinside
Are you using spark 1.1 ? If yes you would have to update the datastax cassandra connector code and remove ref to log methods from CassandraConnector.scala Regards, Gaurav -- View this message in context:

Re: flattening a list in spark sql

2014-09-10 Thread gtinside
My bad, please ignore, it works !!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/flattening-a-list-in-spark-sql-tp13300p13901.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark SQL on Cassandra

2014-09-08 Thread gtinside
Hi , I am reading data from Cassandra through datastax spark-cassandra connector converting it into JSON and then running spark-sql on it. Refer to the code snippet below : step 1 val o_rdd = sc.cassandraTable[CassandraRDDWrapper]( 'keyspace', 'column_family') step 2 val tempObjectRDD =

Re: flattening a list in spark sql

2014-09-02 Thread gtinside
Thanks . I am not using hive context . I am loading data from Cassandra and then converting it into json and then querying it through SQL context . Can I use use hive context to query on a jsonRDD ? Gaurav -- View this message in context: