Hi Friends,
I want to write a UDF which takes variable number of arguments with varying
type.
myudf(String key1, String value1, String key2, int value2,)
What is the best way to do it in Spark?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3
Well I figure out a way to use explode. But it returns two rows if there is two
match in nested array objects.
select id from department LATERAL VIEW explode(employee) dummy_table as emp
where emp.name = 'employee0'
I was looking for an operator that loops through the array and return true
Thanks for you response Yong! Array syntax works fine. But I am not sure how to
use explode. Should I use as follows?
select id from department where explode(employee).name = 'employee0
This query gives me java.lang.UnsupportedOperationException . I am using
HiveContext.
From:
Hi Friends,
What is the right syntax to query on collection of nested object? I have a
following schema and SQL. But it does not return anything. Is the syntax
correct?
root
|-- id: string (nullable = false)
|-- employee: array (nullable = false)
||-- element: struct (containsNull = true)
Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to
join 2 1 billion rows tables in 3 minutes.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html
Sent from the
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-spark-sql-shuffle-partitions-per-query-tp24781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
By skewed did you mean it's not distributed uniformly across partition?
All of my columns are string and almost of same size. i.e.
id1,field11,fields12
id2,field21,field22
--
View this message in context:
Did you get any solution to this? I am getting same issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks all for your reply. I was evaluating which one fits best for me. I
picked epahomov/docker-spark from docker registry and suffice my need.
Thanks
Tridib
Date: Fri, 22 May 2015 14:15:42 +0530
Subject: Re: Official Docker container for Spark
From: riteshoneinamill...@gmail.com
To: 917361
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Official-Docker-container-for-Spark-tp22977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
the hbase release you're using has the following fix ?
HBASE-8 non environment variable solution for IllegalAccessError
Cheers
On Tue, Apr 28, 2015 at 10:47 PM, Tridib Samanta tridib.sama...@live.com
wrote:
I turned on the TRACE and I see lot of following exception
(HConnectionManager.java:1054)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192)
Thanks
Tridib
From: d
I am exactly having same issue. I am running hbase and spark in docker
container.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
On Tue, Apr 28, 2015 at 7:12 PM, tridib tridib.sama...@live.com wrote:
I am exactly having same issue. I am running hbase and spark in docker
container.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html
the spark-job jar as standalone and
execute the HBase client from a main method, it works fine. Same client unable
to connect/hangs when the jar is distributed in spark.
Thanks
Tridib
Date: Tue, 28 Apr 2015 21:25:41 -0700
Subject: Re: HBase HTable constructor hangs
From: yuzhih...@gmail.com
Hello,
Is there in built function for getting median and standard deviation in
spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling
doubleRDD.stats(). But still it does not have median.
What is the most efficient way to get the median?
Thanks Regards
Tridib
--
View
Somehow my posts are not getting excepted, and replies are not visible here.
But I got following reply from Zhan.
From Zhan Zhang's reply, yes I still get the parquet's advantage.
My next question is, if I operate on SchemaRdd will I get the advantage of
Spark SQL's in memory columnar store
an alias to the count in the select clause and use that alias in the
order by clause.
On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta tridib.sama...@live.com
wrote:
Actually I just realized , I am using 1.2.0.
Thanks
Tridib
Date: Thu, 26 Feb 2015 12:37:06 +0530
Subject: Re: group by order
Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Using Hivecontext solved it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p21807.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Actually I just realized , I am using 1.2.0.
Thanks
Tridib
Date: Thu, 26 Feb 2015 12:37:06 +0530
Subject: Re: group by order by fails
From: ak...@sigmoidanalytics.com
To: tridib.sama...@live.com
CC: user@spark.apache.org
Which version of spark are you having? It seems there was a similar Jira
Hi,
I need to find top 10 most selling samples. So query looks like:
select s.name, count(s.name) from sample s group by s.name order by
count(s.name)
This query fails with following error:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
Sort [COUNT(name#0) ASC], true
I am experimenting with two files and trying to generate 1 parquet file.
public class CompactParquetGenerator implements Serializable {
public void generateParquet(JavaSparkContext sc, String jsonFilePath,
String parquetPath) {
//int MB_128 = 128*1024*1024;
public void generateParquet(JavaSparkContext sc, String jsonFilePath,
String parquetPath) {
//int MB_128 = 128*1024*1024;
//sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128);
//sc.hadoopConfiguration().setInt(parquet.block.size, MB_128);
JavaSQLContext
Ohh...how can I miss that. :(. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks Michael,
It worked like a charm! I have few more queries:
1. Is there a way to control the size of parquet file?
2. Which method do you recommend coalesce(n, true), coalesce(n, false) or
repartition(n)?
Thanks Regards
Tridib
--
View this message in context:
http://apache-spark-user
;
sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128);
sc.hadoopConfiguration().setInt(parquet.block.size, MB_128);
No luck.
Is there a way to control the size/number of parquet files generated?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3
Hello Experts,
I have 5 worker machines with different size of RAM. is there a way to
configure it with different executor memory?
Currently I see that all worker spins up 1 executor with same amount of
memory.
Thanks Regards
Tridib
--
View this message in context:
http://apache-spark-user
=2.4.0 -Phive -Phive-thriftserver -DskipTests
Is there anything I am missing?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-broken-tp19536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
of
spark sql on top of parquet file. Any suggestion?
Thanks Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sum-avg-group-by-specified-ranges-tp19187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
$class.saveAsParquetFile(SchemaRDDLike.scala:76)
at
org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42)
Thanks Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-save-to-Parquet-file-Unsupported-datatype
$.launch(SparkSubmit.scala:353)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Thanks Regards
Tridib
Help please!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18280.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
the compiler. Shall I replay
your session? I can re-run each line except the last one.
Thanks
Tridib
Date: Tue, 21 Oct 2014 09:39:49 -0700
Subject: Re: spark sql: join sql fails after sqlCtx.cacheTable()
From: ri...@infoobjects.com
To: tridib.sama...@live.com
CC: u...@spark.incubator.apache.org
? I can re-run each line except the last one.
[y/n]
Thanks
Tridib
From: terry@smartfocus.com
To: tridib.sama...@live.com; u...@spark.incubator.apache.org
Subject: Re: Unable to use HiveContext in spark-shell
Date: Thu, 6 Nov 2014 17:38:51 +
What version of Spark are you using
Yes. I have org.apache.hadoop.hive package in spark assembly.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I built spark-1.1.0 in a new fresh machine. This issue is gone! Thank you all
for your help.
Thanks Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18324.html
Sent from the Apache Spark User
compiling
HiveContext.class.
That entry seems to have slain the compiler. Shall I replay
your session? I can re-run each line except the last one.
[y/n]
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261
= true)
||-- State: string (nullable = true)
||-- Hobby: string (nullable = true)
||-- Zip: string (nullable = true)
How do I create a StructField of StructType? I think that's what the root
is.
Thanks Regards
Tridib
--
View this message in context:
http://apache-spark-user-list
How do I create a StructField of StructType? I need to create a nested
schema.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Figured it out. spark-sql --master spark://sparkmaster:7077
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/submit-query-to-spark-cluster-using-spark-sql-tp17182p17183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
:00
2013-11-11T00:00:00
2012-11-11T00:00:00Z
when I query using select * from date_test it returns:
NULL
NULL
NULL
Could you please help me to resolve this issue?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/hive-timestamp-column-always
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val personPath = /hdd/spark/person.json
val person = sqlContext.jsonFile(personPath)
person.printSchema()
person.registerTempTable(person)
val addressPath = /hdd/spark/address.json
val address = sqlContext.jsonFile(addressPath)
Hmm... I thought HiveContext will only worki if Hive is present. I am curious
to know when to use HiveContext and when to use SqlContext.
Thanks Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx
Thank for pointing that out.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Any help? or comments?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Yes, I am unable to use jsonFile() so that it can detect date type
automatically from json data.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html
Sent from the Apache
= sqlCtx.jsonFile(path, createStructType());
sqlCtx.registerRDDAsTable(test, test);
execSql(sqlCtx, select * from test, 1);
}
Input file has a single record:
{timestamp:2014-10-10T01:01:01}
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3
Function and creating
schema RDD from the parsed JavaRDD. Is there any performance impact not
using inbuilt jsonFile()?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881
Stack trace for my second case:
2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0]
executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in
stage 0.0 (TID 0)
scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
at
Spark 1.1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Hello Experts,
I have two tables build using jsonFile(). I can successfully run join query
on these tables. But once I cacheTable(), all join query fails?
Here is stackstrace:
java.lang.NullPointerException
at
52 matches
Mail list logo