Official Docker container for Spark

2015-05-21 Thread tridib
Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Official-Docker-container-for-Spark-tp22977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

nested collection object query

2015-09-28 Thread tridib
Hi Friends, What is the right syntax to query on collection of nested object? I have a following schema and SQL. But it does not return anything. Is the syntax correct? root |-- id: string (nullable = false) |-- employee: array (nullable = false) ||-- element: struct (containsNull = true)

Writing UDF with variable number of arguments

2015-10-05 Thread tridib
Hi Friends, I want to write a UDF which takes variable number of arguments with varying type. myudf(String key1, String value1, String key2, int value2,) What is the best way to do it in Spark? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-21 Thread tridib
Did you get any solution to this? I am getting same issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread tridib
By skewed did you mean it's not distributed uniformly across partition? All of my columns are string and almost of same size. i.e. id1,field11,fields12 id2,field21,field22 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-

How to control spark.sql.shuffle.partitions per query

2015-09-23 Thread tridib
Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-spark-sql-shuffle-partitions-per-query-tp24781.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-23 Thread tridib
Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to join 2 1 billion rows tables in 3 minutes. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html Sent from the A

RE: spark sql: join sql fails after sqlCtx.cacheTable()

2015-02-25 Thread tridib
Using Hivecontext solved it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p21807.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

group by order by fails

2015-02-25 Thread tridib
Hi, I need to find top 10 most selling samples. So query looks like: select s.name, count(s.name) from sample s group by s.name order by count(s.name) This query fails with following error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: Sort [COUNT(name#0) ASC], true

Running spark function on parquet without sql

2015-02-26 Thread tridib
& Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html Sent from the Apache Spark User List mailing list archive at Nabble

Re: Running spark function on parquet without sql

2015-02-27 Thread tridib
Somehow my posts are not getting excepted, and replies are not visible here. But I got following reply from Zhan. >From Zhan Zhang's reply, yes I still get the parquet's advantage. My next question is, if I operate on SchemaRdd will I get the advantage of Spark SQL's in memory columnar store whe

spark sql median and standard deviation

2015-03-04 Thread tridib
Hello, Is there in built function for getting median and standard deviation in spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling doubleRDD.stats(). But still it does not have median. What is the most efficient way to get the median? Thanks & Regards Tridib --

Re: HBase HTable constructor hangs

2015-04-28 Thread tridib
I am exactly having same issue. I am running hbase and spark in docker container. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

spark sql: timestamp in json - fails

2014-10-20 Thread tridib
uot;; JavaSchemaRDD test = sqlCtx.jsonFile(path, createStructType()); sqlCtx.registerRDDAsTable(test, "test"); execSql(sqlCtx, "select * from test", 1); } Input file has a single record: {"timestamp":"2014-10-10T01:01:01"} Tha

Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-20 Thread tridib
Function and creating schema RDD from the parsed JavaRDD. Is there any performance impact not using inbuilt jsonFile()? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881

Re: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Stack trace for my second case: 2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0] executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in stage 0.0 (TID 0) scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$) at org.ap

RE: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - T

spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-20 Thread tridib
Hello Experts, I have two tables build using jsonFile(). I can successfully run join query on these tables. But once I cacheTable(), all join query fails? Here is stackstrace: java.lang.NullPointerException at org.apache.spark.sql.columnar.InMemoryRelation.statistics$lzycompute(InMemoryCol

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
val sqlContext = new org.apache.spark.sql.SQLContext(sc) val personPath = "/hdd/spark/person.json" val person = sqlContext.jsonFile(personPath) person.printSchema() person.registerTempTable("person") val addressPath = "/hdd/spark/address.json" val address = sqlContext.jsonFile(addressPath) address.

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Hmm... I thought HiveContext will only worki if Hive is present. I am curious to know when to use HiveContext and when to use SqlContext. Thanks & Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sq

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread tridib
Thank for pointing that out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Any help? or comments? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread tridib
Yes, I am unable to use jsonFile() so that it can detect date type automatically from json data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html Sent from the Apache Spark

hive timestamp column always returns null

2014-10-22 Thread tridib
llowing data: 2014-12-11 00:00:00 2013-11-11T00:00:00 2012-11-11T00:00:00Z when I query using "select * from date_test" it returns: NULL NULL NULL Could you please help me to resolve this issue? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3

submit query to spark cluster using spark-sql

2014-10-23 Thread tridib
I want to submit query to spark cluster using spark-sql. I am using hive meta store. It's in HDFS. But when I query it does not look like get submitted to spark cluster. I don't see any entry in master web UI. How can I confirm the behavior? -- View this message in context: http://apache-spark-

Re: submit query to spark cluster using spark-sql

2014-10-23 Thread tridib
Figured it out. spark-sql --master spark://sparkmaster:7077 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/submit-query-to-spark-cluster-using-spark-sql-tp17182p17183.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

spark sql create nested schema

2014-11-04 Thread tridib
= true) ||-- State: string (nullable = true) ||-- Hobby: string (nullable = true) ||-- Zip: string (nullable = true) How do I create a StructField of StructType? I think that's what the "root" is. Thanks & Regards Tridib -- View this message in context: http:

StructField of StructType

2014-11-04 Thread tridib
How do I create a StructField of StructType? I need to create a nested schema. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Unable to use HiveContext in spark-shell

2014-11-05 Thread tridib
A signature in HiveContext.class refers to term conf in value org.apache.hadoop.hive which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. That entry seems to have

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
Help please! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18280.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
Yes. I have org.apache.hadoop.hive package in spark assembly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18322.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Unable to use HiveContext in spark-shell

2014-11-06 Thread tridib
I built spark-1.1.0 in a new fresh machine. This issue is gone! Thank you all for your help. Thanks & Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18324.html Sent from the Apache Spark

spark sql - save to Parquet file - Unsupported datatype TimestampType

2014-11-11 Thread tridib
emaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76) at org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42) Thanks & Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-save-to-Parquet-fi

sum/avg group by specified ranges

2014-11-18 Thread tridib
of spark sql on top of parquet file. Any suggestion? Thanks & Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sum-avg-group-by-specified-ranges-tp19187.html Sent from the Apache Spark User List mailing list archive at Nabble

allocating different memory to different executor for same application

2014-11-21 Thread tridib
Hello Experts, I have 5 worker machines with different size of RAM. is there a way to configure it with different executor memory? Currently I see that all worker spins up 1 executor with same amount of memory. Thanks & Regards Tridib -- View this message in context: http://apache-spark-

spark-sql broken

2014-11-21 Thread tridib
rsion=2.4.0 -Phive -Phive-thriftserver -DskipTests Is there anything I am missing? Thanks Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-broken-tp19536.html Sent from the Apache Spark User List mailing list archive at Nabbl

Control number of parquet generated from JavaSchemaRDD

2014-11-24 Thread tridib
; sc.hadoopConfiguration().setInt("dfs.blocksize", MB_128); sc.hadoopConfiguration().setInt("parquet.block.size", MB_128); No luck. Is there a way to control the size/number of parquet files generated? Thanks Tridib -- View this message in context: http://apache-spark-

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
I am experimenting with two files and trying to generate 1 parquet file. public class CompactParquetGenerator implements Serializable { public void generateParquet(JavaSparkContext sc, String jsonFilePath, String parquetPath) { //int MB_128 = 128*1024*1024; //sc.hadoopConfigur

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
public void generateParquet(JavaSparkContext sc, String jsonFilePath, String parquetPath) { //int MB_128 = 128*1024*1024; //sc.hadoopConfiguration().setInt("dfs.blocksize", MB_128); //sc.hadoopConfiguration().setInt("parquet.block.size", MB_128); JavaSQLContext s

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
Ohh...how can I miss that. :(. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19788.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Control number of parquet generated from JavaSchemaRDD

2014-11-25 Thread tridib
Thanks Michael, It worked like a charm! I have few more queries: 1. Is there a way to control the size of parquet file? 2. Which method do you recommend coalesce(n, true), coalesce(n, false) or repartition(n)? Thanks & Regards Tridib -- View this message in context: http://apache-spark-

RE: Official Docker container for Spark

2015-05-29 Thread Tridib Samanta
Thanks all for your reply. I was evaluating which one fits best for me. I picked epahomov/docker-spark from docker registry and suffice my need. Thanks Tridib Date: Fri, 22 May 2015 14:15:42 +0530 Subject: Re: Official Docker container for Spark From: riteshoneinamill...@gmail.com To: 917361

RE: nested collection object query

2015-09-28 Thread Tridib Samanta
Thanks for you response Yong! Array syntax works fine. But I am not sure how to use explode. Should I use as follows? select id from department where explode(employee).name = 'employee0 This query gives me java.lang.UnsupportedOperationException . I am using HiveContext. From: java8...@hotmai

RE: nested collection object query

2015-09-28 Thread Tridib Samanta
Well I figure out a way to use explode. But it returns two rows if there is two match in nested array objects. select id from department LATERAL VIEW explode(employee) dummy_table as emp where emp.name = 'employee0' I was looking for an operator that loops through the array and return true if

RE: group by order by fails

2015-02-25 Thread Tridib Samanta
Actually I just realized , I am using 1.2.0. Thanks Tridib Date: Thu, 26 Feb 2015 12:37:06 +0530 Subject: Re: group by order by fails From: ak...@sigmoidanalytics.com To: tridib.sama...@live.com CC: user@spark.apache.org Which version of spark are you having? It seems there was a similar Jira

RE: group by order by fails

2015-02-27 Thread Tridib Samanta
an alias to the count in the select clause and use that alias in the order by clause. On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta wrote: Actually I just realized , I am using 1.2.0. Thanks Tridib Date: Thu, 26 Feb 2015 12:37:06 +0530 Subject: Re: group by order by fails From: ak

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
hread with this subject. Cheers On Tue, Apr 28, 2015 at 7:12 PM, tridib wrote: I am exactly having same issue. I am running hbase and spark in docker container. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
I run the spark-job jar as standalone and execute the HBase client from a main method, it works fine. Same client unable to connect/hangs when the jar is distributed in spark. Thanks Tridib Date: Tue, 28 Apr 2015 21:25:41 -0700 Subject: Re: HBase HTable constructor hangs From: yuzhih...@gmai

RE: HBase HTable constructor hangs

2015-04-29 Thread Tridib Samanta
(HConnectionManager.java:1054) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326) at org.apache.hadoop.hbase.client.HTable.(HTable.java:192) Thanks Tridib From: d

RE: HBase HTable constructor hangs

2015-04-30 Thread Tridib Samanta
the hbase release you're using has the following fix ? HBASE-8 non environment variable solution for "IllegalAccessError" Cheers On Tue, Apr 28, 2015 at 10:47 PM, Tridib Samanta wrote: I turned on the TRACE and I see lot of following exception: java.lang.IllegalAccessEr

RE: spark sql: join sql fails after sqlCtx.cacheTable()

2014-11-06 Thread Tridib Samanta
org.apache.hadoop.hive which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. That entry seems to have slain the compiler. Shall I replay your session? I can re-run

RE: Unable to use HiveContext in spark-shell

2014-11-06 Thread Tridib Samanta
ilable. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveContext.class. That entry seems to have slain the compiler. Shall I replay your session? I can re-run each line except the last

sql - group by on UDF not working

2014-11-07 Thread Tridib Samanta
) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks & Regards Tridib