Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Official-Docker-container-for-Spark-tp22977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
Hi Friends,
What is the right syntax to query on collection of nested object? I have a
following schema and SQL. But it does not return anything. Is the syntax
correct?
root
|-- id: string (nullable = false)
|-- employee: array (nullable = false)
||-- element: struct (containsNull = true)
Hi Friends,
I want to write a UDF which takes variable number of arguments with varying
type.
myudf(String key1, String value1, String key2, int value2,)
What is the best way to do it in Spark?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3
Did you get any solution to this? I am getting same issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
By skewed did you mean it's not distributed uniformly across partition?
All of my columns are string and almost of same size. i.e.
id1,field11,fields12
id2,field21,field22
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-spark-sql-shuffle-partitions-per-query-tp24781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to
join 2 1 billion rows tables in 3 minutes.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html
Sent from the A
Using Hivecontext solved it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p21807.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Hi,
I need to find top 10 most selling samples. So query looks like:
select s.name, count(s.name) from sample s group by s.name order by
count(s.name)
This query fails with following error:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
Sort [COUNT(name#0) ASC], true
& Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-function-on-parquet-without-sql-tp21833.html
Sent from the Apache Spark User List mailing list archive at Nabble
Somehow my posts are not getting excepted, and replies are not visible here.
But I got following reply from Zhan.
>From Zhan Zhang's reply, yes I still get the parquet's advantage.
My next question is, if I operate on SchemaRdd will I get the advantage of
Spark SQL's in memory columnar store whe
Hello,
Is there in built function for getting median and standard deviation in
spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling
doubleRDD.stats(). But still it does not have median.
What is the most efficient way to get the median?
Thanks & Regards
Tridib
--
I am exactly having same issue. I am running hbase and spark in docker
container.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
uot;;
JavaSchemaRDD test = sqlCtx.jsonFile(path, createStructType());
sqlCtx.registerRDDAsTable(test, "test");
execSql(sqlCtx, "select * from test", 1);
}
Input file has a single record:
{"timestamp":"2014-10-10T01:01:01"}
Tha
Function and creating
schema RDD from the parsed JavaRDD. Is there any performance impact not
using inbuilt jsonFile()?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881
Stack trace for my second case:
2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0]
executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in
stage 0.0 (TID 0)
scala.MatchError: TimestampType (of class
org.apache.spark.sql.catalyst.types.TimestampType$)
at
org.ap
Spark 1.1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
T
Hello Experts,
I have two tables build using jsonFile(). I can successfully run join query
on these tables. But once I cacheTable(), all join query fails?
Here is stackstrace:
java.lang.NullPointerException
at
org.apache.spark.sql.columnar.InMemoryRelation.statistics$lzycompute(InMemoryCol
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val personPath = "/hdd/spark/person.json"
val person = sqlContext.jsonFile(personPath)
person.printSchema()
person.registerTempTable("person")
val addressPath = "/hdd/spark/address.json"
val address = sqlContext.jsonFile(addressPath)
address.
Hmm... I thought HiveContext will only worki if Hive is present. I am curious
to know when to use HiveContext and when to use SqlContext.
Thanks & Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sq
Thank for pointing that out.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Any help? or comments?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
Yes, I am unable to use jsonFile() so that it can detect date type
automatically from json data.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html
Sent from the Apache Spark
llowing data:
2014-12-11 00:00:00
2013-11-11T00:00:00
2012-11-11T00:00:00Z
when I query using "select * from date_test" it returns:
NULL
NULL
NULL
Could you please help me to resolve this issue?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3
I want to submit query to spark cluster using spark-sql. I am using hive meta
store. It's in HDFS.
But when I query it does not look like get submitted to spark cluster. I
don't see any entry in master web UI. How can I confirm the behavior?
--
View this message in context:
http://apache-spark-
Figured it out. spark-sql --master spark://sparkmaster:7077
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/submit-query-to-spark-cluster-using-spark-sql-tp17182p17183.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
= true)
||-- State: string (nullable = true)
||-- Hobby: string (nullable = true)
||-- Zip: string (nullable = true)
How do I create a StructField of StructType? I think that's what the "root"
is.
Thanks & Regards
Tridib
--
View this message in context:
http:
How do I create a StructField of StructType? I need to create a nested
schema.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
A signature
in HiveContext.class refers to term conf
in value org.apache.hadoop.hive which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
That entry seems to have
Help please!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18280.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Yes. I have org.apache.hadoop.hive package in spark assembly.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
I built spark-1.1.0 in a new fresh machine. This issue is gone! Thank you all
for your help.
Thanks & Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18324.html
Sent from the Apache Spark
emaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:76)
at
org.apache.spark.sql.api.java.JavaSchemaRDD.saveAsParquetFile(JavaSchemaRDD.scala:42)
Thanks & Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-save-to-Parquet-fi
of
spark sql on top of parquet file. Any suggestion?
Thanks & Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sum-avg-group-by-specified-ranges-tp19187.html
Sent from the Apache Spark User List mailing list archive at Nabble
Hello Experts,
I have 5 worker machines with different size of RAM. is there a way to
configure it with different executor memory?
Currently I see that all worker spins up 1 executor with same amount of
memory.
Thanks & Regards
Tridib
--
View this message in context:
http://apache-spark-
rsion=2.4.0 -Phive -Phive-thriftserver -DskipTests
Is there anything I am missing?
Thanks
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-broken-tp19536.html
Sent from the Apache Spark User List mailing list archive at Nabbl
;
sc.hadoopConfiguration().setInt("dfs.blocksize", MB_128);
sc.hadoopConfiguration().setInt("parquet.block.size", MB_128);
No luck.
Is there a way to control the size/number of parquet files generated?
Thanks
Tridib
--
View this message in context:
http://apache-spark-
I am experimenting with two files and trying to generate 1 parquet file.
public class CompactParquetGenerator implements Serializable {
public void generateParquet(JavaSparkContext sc, String jsonFilePath,
String parquetPath) {
//int MB_128 = 128*1024*1024;
//sc.hadoopConfigur
public void generateParquet(JavaSparkContext sc, String jsonFilePath,
String parquetPath) {
//int MB_128 = 128*1024*1024;
//sc.hadoopConfiguration().setInt("dfs.blocksize", MB_128);
//sc.hadoopConfiguration().setInt("parquet.block.size", MB_128);
JavaSQLContext s
Ohh...how can I miss that. :(. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Thanks Michael,
It worked like a charm! I have few more queries:
1. Is there a way to control the size of parquet file?
2. Which method do you recommend coalesce(n, true), coalesce(n, false) or
repartition(n)?
Thanks & Regards
Tridib
--
View this message in context:
http://apache-spark-
Thanks all for your reply. I was evaluating which one fits best for me. I
picked epahomov/docker-spark from docker registry and suffice my need.
Thanks
Tridib
Date: Fri, 22 May 2015 14:15:42 +0530
Subject: Re: Official Docker container for Spark
From: riteshoneinamill...@gmail.com
To: 917361
Thanks for you response Yong! Array syntax works fine. But I am not sure how to
use explode. Should I use as follows?
select id from department where explode(employee).name = 'employee0
This query gives me java.lang.UnsupportedOperationException . I am using
HiveContext.
From: java8...@hotmai
Well I figure out a way to use explode. But it returns two rows if there is two
match in nested array objects.
select id from department LATERAL VIEW explode(employee) dummy_table as emp
where emp.name = 'employee0'
I was looking for an operator that loops through the array and return true if
Actually I just realized , I am using 1.2.0.
Thanks
Tridib
Date: Thu, 26 Feb 2015 12:37:06 +0530
Subject: Re: group by order by fails
From: ak...@sigmoidanalytics.com
To: tridib.sama...@live.com
CC: user@spark.apache.org
Which version of spark are you having? It seems there was a similar Jira
an alias to the count in the select clause and use that alias in the
order by clause.
On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta
wrote:
Actually I just realized , I am using 1.2.0.
Thanks
Tridib
Date: Thu, 26 Feb 2015 12:37:06 +0530
Subject: Re: group by order by fails
From: ak
hread with this subject.
Cheers
On Tue, Apr 28, 2015 at 7:12 PM, tridib wrote:
I am exactly having same issue. I am running hbase and spark in docker
container.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696
I run the spark-job jar as standalone and
execute the HBase client from a main method, it works fine. Same client unable
to connect/hangs when the jar is distributed in spark.
Thanks
Tridib
Date: Tue, 28 Apr 2015 21:25:41 -0700
Subject: Re: HBase HTable constructor hangs
From: yuzhih...@gmai
(HConnectionManager.java:1054)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:192)
Thanks
Tridib
From: d
the hbase release you're using has the following fix ?
HBASE-8 non environment variable solution for "IllegalAccessError"
Cheers
On Tue, Apr 28, 2015 at 10:47 PM, Tridib Samanta
wrote:
I turned on the TRACE and I see lot of following exception:
java.lang.IllegalAccessEr
org.apache.hadoop.hive which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
That entry seems to have slain the compiler. Shall I replay
your session? I can re-run
ilable.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
HiveContext.class.
That entry seems to have slain the compiler. Shall I replay
your session? I can re-run each line except the last
)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Thanks & Regards
Tridib
53 matches
Mail list logo