And if you want to use the SQL CLI (based on catalyst) as it works in Shark,
you can also check out https://github.com/amplab/shark/pull/337 :)
This preview version doesn’t require the Hive to be setup in the cluster.
(Don’t forget to put the hive-site.xml under SHARK_HOME/conf also)
Cheng Hao
))
sparkContext.makeRDD(rows).registerAsTable(foo)
sql(select k,count(*) from foo group by k).collect
res1: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])
Cheng Hao
From: Pei-Lun Lee [mailto:pl...@appier.com]
Sent: Wednesday, June 11, 2014 6:01 PM
To: user@spark.apache.org
Subject
Subject: RE: Hive From Spark
Hi Cheng Hao,
Thank you very much for your reply.
Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .
Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt
assembly/assembly, including the jar in all the workers, and copying the
hive
This is a very interesting problem. SparkSQL supports the Non Equi Join, but it
is in very low efficiency with large tables.
One possible solution is make both table partition based and the partition keys
are (cast(ds as bigint) / 240), and with each partition in dataset1, you
probably can
Actually it's just a pseudo algorithm I described, you can do it with spark
API. Hope the algorithm helpful.
-Original Message-
From: durga [mailto:durgak...@gmail.com]
Sent: Tuesday, July 22, 2014 11:56 AM
To: u...@spark.incubator.apache.org
Subject: RE: Joining by timestamp.
Hi Chen,
Durga, you can start from the documents
http://spark.apache.org/docs/latest/quick-start.html
http://spark.apache.org/docs/latest/programming-guide.html
-Original Message-
From: durga [mailto:durgak...@gmail.com]
Sent: Tuesday, July 22, 2014 12:45 PM
To:
In your code snippet, sample is actually a SchemaRDD, and SchemaRDD actually
binds a certain SQLContext in runtime, I don't think we can manipulate/share
the SchemaRDD across SQLContext Instances.
-Original Message-
From: Kevin Jung [mailto:itsjb.j...@samsung.com]
Sent: Tuesday, July
From the log, I noticed the substr was added on July 15th, 1.0.1 release
should be earlier than that. Community is now working on releasing the 1.1.0,
and also some of the performance improvements were added. Probably you can try
that for your benchmark.
Cheng Hao
-Original Message
I couldn’t reproduce the exception, probably it’s solved in the latest code.
From: Vishal Vibhandik [mailto:vishal.vibhan...@gmail.com]
Sent: Thursday, August 14, 2014 11:17 AM
To: user@spark.apache.org
Subject: Spark SQL Stackoverflow error
Hi,
I tried running the sample sql code JavaSparkSQL
Currently SparkSQL doesn’t support the row format/serde in CTAS. The work
around is create the table first.
-Original Message-
From: centerqi hu [mailto:cente...@gmail.com]
Sent: Tuesday, September 02, 2014 3:35 PM
To: user@spark.apache.org
Subject: Unsupported language features in
[mailto:cente...@gmail.com]
Sent: Tuesday, September 02, 2014 3:46 PM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Unsupported language features in query
Thanks Cheng Hao
Have a way of obtaining spark support hive statement list?
Thanks
2014-09-02 15:39 GMT+08:00 Cheng, Hao hao.ch
Hive can launch another job with strategy to merged the small files, probably
we can also do that in the future release.
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, September 05, 2014 8:59 AM
To: DanteSama
Cc: u...@spark.incubator.apache.org
Subject: Re: SchemaRDD -
I copied the 3 datanucleus jars (datanucleus-api-jdo-3.2.1.jar,
datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar) to the fold lib/
manually, and it works for me.
From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 11:28 AM
To: alexandria1101
Cc:
What's your Spark / Hadoop version? And also the hive-site.xml? Most of case
like that caused by incompatible Hadoop client jar and the Hadoop cluster.
-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com]
Sent: Monday, September 15, 2014 2:35 PM
To:
The Hadoop client jar should be assembled into the uber-jar, but (I suspect)
it's probably not compatible with your Hadoop Cluster.
Can you also paste the Spark uber-jar name? Usually will be under the path
lib/spark-assembly-1.1.0-xxx-hadoopxxx.jar.
-Original Message-
From:
Sorry, I am not able to reproduce that.
Can you try add the following entry into the hive-site.xml? I know they have
the default value, but let's make it explicitly.
hive.server2.thrift.port
hive.server2.thrift.bind.host
hive.server2.authentication (NONE、KERBEROS、LDAP、PAM or CUSTOM)
Thank you for pasting the steps, I will look at this, hopefully come out with a
solution soon.
-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com]
Sent: Tuesday, September 16, 2014 3:17 PM
To: u...@spark.incubator.apache.org
Subject: RE: SparkSQL 1.1 hang when DROP
is working on upgrading the Hive to 0.13 for SparkSQL
(https://github.com/apache/spark/pull/2241), not sure if you can wait for this.
☺
From: Yin Huai [mailto:huaiyin@gmail.com]
Sent: Wednesday, September 17, 2014 1:50 AM
To: Cheng, Hao
Cc: linkpatrickliu; u...@spark.incubator.apache.org
Subject
the null value
when retrieving HiveConf.
Cheng Hao
From: Du Li [mailto:l...@yahoo-inc.com.INVALID]
Sent: Thursday, September 18, 2014 7:51 AM
To: user@spark.apache.org; d...@spark.apache.org
Subject: problem with HiveContext inside Actor
Hi,
Wonder anybody had similar experience or any suggestion here
(1::2::Nil).map(i= T(i.toString, new
java.sql.Timestamp(i)))
data.registerTempTable(x)
val s = sqlContext.sql(select a from x where ts='1970-01-01 00:00:00';)
s.collect
output:
res1: Array[org.apache.spark.sql.Row] = Array([1], [2])
Cheng Hao
From: Mohammed Guller [mailto:moham
Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of
the data types supported by Catalyst.
From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 11:44 PM
To: Wang, Daoyuan; user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp
You needn't do anything, the implicit conversion should do this for you.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L103
not sure about how kd tree used in mllib. but keep in mind
SchemaRDD is just a normal RDD.
Cheng Hao
From: sanath kumar [mailto:sanath1...@gmail.com]
Sent: Wednesday, October 22, 2014 12:58 PM
To: user@spark.apache.org
Subject: spark sql query optimization , and decision tree building
Hi all
Can you paste the hive-site.xml? Most of times I meet this exception, because
the JDBC driver for hive metastore are not correct set or wrong driver classes
are included in the assembly jar.
As default, the assembly jar contains the derby.jar, which is the embedded
derby JDBC driver.
From:
.
Sorry if I missed some discussion of Hive upgrading.
Cheng Hao
Hi, all, I noticed that when compiling the SparkSQL with profile hive-0.13.1,
it will fetch the Hive version of 0.13.1a under groupId
org.spark-project.hive, what's the difference with the one of
org.apache.hive? And where can I get the source code for re-compiling?
Thanks,
Cheng Hao
Which version are you using? I can reproduce that in the latest code, but with
different exception.
I've filed an bug https://issues.apache.org/jira/browse/SPARK-4263, can you
also add some information there?
Thanks,
Cheng Hao
-Original Message-
From: Kevin Paul [mailto:kevinpaulap
Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs
LIMIT 5”, I guess you probably ran into the timestamp precision or the timezone
shifting problem.
(And it’s not mandatory, but you’d better change the field name from
“timestamp” to something else, as “timestamp” is
Are all of your join keys the same? and I guess the join type are all “Left”
join, https://github.com/apache/spark/pull/3362 probably is what you need.
And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join)
currently, https://github.com/apache/spark/pull/3270 should be
Spark SQL doesn't support the DISTINCT well currently, particularly the case
you described, it will leads all of the data fall into a single node and keep
them in memory only.
Dev community actually has solutions for this, it probably will be solved after
the release of Spark 1.2.
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Thursday, November 27, 2014 10:24 PM
To: Cheng, Hao
Cc: user
Subject: Re: Auto BroadcastJoin optimization failed in latest Spark
Hi Hao,
I'm using inner join as Broadcast join didn't work for left joins (thanks for
the links
/pull/3595 )
b. It expects the function return type to be immutable.Seq[XX] for List,
immutable.Map[X, X] for Map, scala.Product for Struct, and only Array[Byte] for
binary. The Array[_] is not supported.
Cheng Hao
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Thursday, December 4
You can try to write your own Relation with filter push down or use the
ParquetRelation2 for workaround.
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala)
Cheng Hao
-Original Message-
From: Jerry Raj [mailto:jerry
It works exactly like Create Table As Select (CTAS) in Hive.
Cheng Hao
From: Anas Mosaad [mailto:anas.mos...@incorta.com]
Sent: Wednesday, December 10, 2014 11:59 AM
To: Michael Armbrust
Cc: Manoj Samel; user@spark.apache.org
Subject: Re: Can HiveContext be used without using Hive
As the error log shows, you may need to register it as:
sqlContext.rgisterFunction(“toHour”, toHour _)
The “_” means you are passing the function as parameter, not invoking it.
Cheng Hao
From: Xuelin Cao [mailto:xuelin...@yahoo.com.INVALID]
Sent: Monday, December 15, 2014 5:28 PM
To: User
Hi, Lam, I can confirm this is a bug with the latest master, and I filed a jira
issue for this:
https://issues.apache.org/jira/browse/SPARK-4944
Hope come with a solution soon.
Cheng Hao
From: Jerry Lam [mailto:chiling...@gmail.com]
Sent: Wednesday, December 24, 2014 4:26 AM
To: user
I am wondering if we can provide more friendly API, other than configuration
for this purpose. What do you think Patrick?
Cheng Hao
-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Thursday, December 25, 2014 3:22 PM
To: Shao, Saisai
Cc: user@spark.apache.org
multiple parquet files
for API sqlContext.parquetFile, we need to think how to support multiple paths
in some other way.
Cheng Hao
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, December 25, 2014 1:01 PM
To: Daniel Siegmann
Cc: user@spark.apache.org
Subject: Re: Escape
The log showed it failed in parsing, so the typo stuff shouldn’t be the root
cause. BUT I couldn’t reproduce that with master branch.
I did the test as follow:
sbt/sbt –Phadoop-2.3.0 –Phadoop-2.3 –Phive –Phive-0.13.1 hive/console
scala sql(“SELECT user_id FROM actions where
It seems the netty jar works with an incompatible method signature. Can you
check if there different versions of netty jar in your classpath?
From: Walrus theCat [mailto:walrusthe...@gmail.com]
Sent: Sunday, January 18, 2015 3:37 PM
To: user@spark.apache.org
Subject: Re: SparkSQL 1.2.0 sources
Hi, BB
Ideally you can do the query like: select key, value.percent from
mytable_data lateral view explode(audiences) f as key, value limit 3;
But there is a bug in HiveContext:
https://issues.apache.org/jira/browse/SPARK-5237
I am working on it now, hopefully make a patch soon.
Cheng
The Data Source API probably work for this purpose.
It support the column pruning and the Predicate Push Down:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
Examples also can be found in the unit test:
Wow, glad to know that it works well, and sorry, the Jira is another issue,
which is not the same case here.
From: Bagmeet Behera [mailto:bagme...@gmail.com]
Sent: Saturday, January 17, 2015 12:47 AM
To: Cheng, Hao
Subject: Re: using hiveContext to select a nested Map-data-type from
The root cause for this probably because the identical “exprId” of the
“AttributeReference” existed while do self-join with “temp table” (temp table =
resolved logical plan).
I will do the bug fixing and JIRA creation.
Cheng Hao
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent
Can you paste the error log?
From: Dai, Kevin [mailto:yun...@ebay.com]
Sent: Monday, January 5, 2015 6:29 PM
To: user@spark.apache.org
Subject: Implement customized Join for SparkSQL
Hi, All
Suppose I want to join two tables A and B as follows:
Select * from A join B on A.id = B.id
A is a
Are you using the SQLContext? I think the HiveContext is recommended.
Cheng Hao
From: Wush Wu [mailto:w...@bridgewell.com]
Sent: Thursday, February 12, 2015 2:24 PM
To: u...@spark.incubator.apache.org
Subject: Extract hour from Timestamp in Spark SQL
Dear all,
I am new to Spark SQL and have
Seems the elasticsearch-hadoop project was built with an old version of Spark,
and then you upgraded the Spark version in execution env, as I know the
StructField changed the definition in Spark 1.2, can you confirm the version
problem first?
From: Todd Nist [mailto:tsind...@gmail.com]
Sent:
Not so sure your intention, but something like SELECT sum(val1), sum(val2)
FROM table GROUP BY src, dest ?
-Original Message-
From: Shailesh Birari [mailto:sbirar...@gmail.com]
Sent: Friday, March 20, 2015 9:31 AM
To: user@spark.apache.org
Subject: Spark SQL Self join with agreegate
Or you need to specify the jars either in configuration or
bin/spark-sql --jars mysql-connector-xx.jar
From: fightf...@163.com [mailto:fightf...@163.com]
Sent: Monday, March 16, 2015 2:04 PM
To: sandeep vura; Ted Yu
Cc: user
Subject: Re: Re: Unable to instantiate
It doesn’t take effect if just putting jar files under the lib-managed/jars
folder, you need to put that under class path explicitly.
From: sandeep vura [mailto:sandeepv...@gmail.com]
Sent: Monday, March 16, 2015 2:21 PM
To: Cheng, Hao
Cc: fightf...@163.com; Ted Yu; user
Subject: Re: Re: Unable
You can add the additional jar when submitting your job, something like:
./bin/spark-submit --jars xx.jar …
More options can be listed by just typing ./bin/spark-submit
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 10, 2015 8:48 PM
To: user@spark.apache.org
Subject: Does
Currently, Spark SQL doesn’t provide interface for developing the custom UDTF,
but it can work seamless with Hive UDTF.
I am working on the UDTF refactoring for Spark SQL, hopefully will provide an
Hive independent UDTF soon after that.
From: shahab [mailto:shahab.mok...@gmail.com]
Sent:
check the configuration file of
$SPARK_HOME/conf/spark-xxx.conf ?
Cheng Hao
From: Grandl Robert [mailto:rgra...@yahoo.com.INVALID]
Sent: Thursday, March 12, 2015 5:07 AM
To: user@spark.apache.org
Subject: Spark SQL using Hive metastore
Hi guys,
I am a newbie in running Spark SQL / Spark. My goal
/pull/3247
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Wednesday, March 11, 2015 1:44 AM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?
Thanks Hao,
But my question concerns UDAF (user defined aggregation function ) not UDTF
I am not so sure if Hive supports change the metastore after initialized, I
guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably
that's why it doesn't work as expected for Q1.
BTW, in most of cases, people configure the metastore settings in
hive-site.xml, and will not
Intel has a prototype for doing this, SaiSai and Jason are the authors.
Probably you can ask them for some materials.
From: Mohit Anchlia [mailto:mohitanch...@gmail.com]
Sent: Wednesday, March 11, 2015 8:12 AM
To: user@spark.apache.org
Subject: SQL with Spark Streaming
Does Spark Streaming also
This is a very interesting issue, the root reason for the lower performance
probably is, in Scala UDF, Spark SQL converts the data type from internal
representation to Scala representation via Scala reflection recursively.
Can you create a Jira issue for tracking this? I can start to work on
$.main(SparkSQLCLIDriver.scala:202)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
Thanks,
Cheng Hao
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
It works after adding the -Djline.terminal=jline.UnsupportedTerminal
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Saturday, February 28, 2015 10:24 AM
To: user@spark.apache.org
Subject: JLine hangs under Windows8
Hi, All
I was trying to run spark sql cli
instance.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 3, 2015 7:56 AM
To: Cheng, Hao; user
Subject: RE: Is SQLContext thread-safe?
Thanks for the response.
Then I have another question: when will we want to create multiple SQLContext
instances
Copy those jars into the $SPARK_HOME/lib/
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
see https://github.com/apache/spark/blob/master/bin/compute-classpath.sh#L120
-Original Message-
From: fanooos [mailto:dev.fano...@gmail.com]
Sent: Tuesday,
Can you provide the detailed failure call stack?
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 3, 2015 3:52 PM
To: user@spark.apache.org
Subject: Supporting Hive features in Spark SQL Thrift JDBC server
Hi,
According to Spark SQL documentation, Spark SQL supports the
I am not so sure how Spark SQL compiled in CDH, but if didn’t specify the
–Phive and –Phive-thriftserver flags during the build, most likely it will not
work if just by providing the Hive lib jars later on. For example, does the
HiveContext class exist in the assembly jar?
I am also quite
https://issues.apache.org/jira/browse/SPARK-2087
https://github.com/apache/spark/pull/4382
I am working on the prototype, but will be updated soon.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 3, 2015 8:32 AM
To: Cheng, Hao; user
Subject: RE
Using where('age =10 'age =4) instead.
-Original Message-
From: Guillermo Ortiz [mailto:konstt2...@gmail.com]
Sent: Tuesday, March 3, 2015 5:14 PM
To: user
Subject: SparkSQL, executing an OR
I'm trying to execute a query with Spark.
(Example from the Spark Documentation)
val teenagers
Hive UDF are only applicable for HiveContext and its subclass instance, is the
CassandraAwareSQLContext a direct sub class of HiveContext or SQLContext?
From: shahab [mailto:shahab.mok...@gmail.com]
Sent: Tuesday, March 3, 2015 5:10 PM
To: Cheng, Hao
Cc: user@spark.apache.org
Subject: Re
Using the SchemaRDD / DataFrame API via HiveContext
Assume you're using the latest code, something probably like:
val hc = new HiveContext(sc)
import hc.implicits._
existedRdd.toDF().insertInto(hivetable)
or
existedRdd.toDF().registerTempTable(mydata)
hc.sql(insert into hivetable as select xxx
As the call stack shows, the mongodb connector is not compatible with the Spark
SQL Data Source interface. The latest Data Source API is changed since 1.2,
probably you need to confirm which spark version the MongoDB Connector build
against.
By the way, a well format call stack will be more
This is actually a quite open question, from my understanding, there're
probably ways to tune like:
*SQL Configurations like:
Configuration Key
Default Value
spark.sql.autoBroadcastJoinThreshold
10 * 1024 * 1024
spark.sql.defaultSizeInBytes
10 * 1024 * 1024 + 1
Yes it is thread safe, at least it's supposed to be.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Monday, March 2, 2015 4:43 PM
To: user
Subject: Is SQLContext thread-safe?
Hi, is it safe to use the same SQLContext to do Select operations in different
threads
I’ve tried with latest code, seems it works, which version are you using Shahab?
From: yana [mailto:yana.kadiy...@gmail.com]
Sent: Wednesday, March 4, 2015 8:47 PM
To: shahab; user@spark.apache.org
Subject: RE: Does SparkSQL support . having count (fieldname) in SQL
statement?
I think the
” while starting the spark shell.
From: Anusha Shamanur [mailto:anushas...@gmail.com]
Sent: Wednesday, March 4, 2015 5:07 AM
To: Cheng, Hao
Subject: Re: Spark SQL Thrift Server start exception :
java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
Hi,
I am getting
Can you query upon Hive? Let's confirm if it's a bug of SparkSQL in your PHP
code first.
-Original Message-
From: fanooos [mailto:dev.fano...@gmail.com]
Sent: Thursday, March 5, 2015 4:57 PM
To: user@spark.apache.org
Subject: Connection PHP application to Spark Sql thrift server
We
How many reducers you set for Hive? With small data set, Hive will run in local
mode, which will set the reducer count always as 1.
From: Kannan Rajah [mailto:kra...@maprtech.com]
Sent: Thursday, February 26, 2015 3:02 AM
To: Cheng Lian
Cc: user@spark.apache.org
Subject: Re: Spark-SQL 1.2.0 sort
Hi, can you describe a little bit how the ThriftServer crashed, or steps to
reproduce that? It’s probably a bug of ThriftServer.
Thanks,
From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk]
Sent: Friday, April 24, 2015 9:55 AM
To: Arush Kharbanda
Cc: user
Subject: Re: Re: problem
Can you print out the physical plan?
EXPLAIN SELECT xxx…
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复:Re: sparksql running slow while joining 2 tables.
hi Olivier
spark1.3.1, with java1.8.0.45
and add 2 pics
I assume you’re using the DataFrame API within your application.
sql(“SELECT…”).explain(true)
From: Wang, Daoyuan
Sent: Tuesday, May 5, 2015 10:16 AM
To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user
Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.
You can use
, Hao; Wang, Daoyuan; Olivier Girardot; user
Subject: 回复:Re: sparksql running slow while joining_2_tables.
Hi guys,
attache the pic of physical plan and logs.Thanks.
Thanksamp;Best regards!
罗辉 San.Luo
- 原始邮件 -
发件人:Cheng, Hao hao.ch
Or, have you ever try broadcast join?
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20...@sina.com; Olivier Girardot; user
Subject: RE: 回复:Re: sparksql running slow while joining 2 tables.
Can you print out the physical plan?
EXPLAIN SELECT xxx
Spark SQL just take the JDBC as a new data source, the same as we need to
support loading data from a .csv or .json.
From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Friday, May 15, 2015 2:30 PM
To: User
Subject: What's the advantage features of Spark SQL(JDBC)
Hi All,
Comparing
You probably can try something like:
val df = sqlContext.sql(select c1, sum(c2) from T1, T2 where T1.key=T2.key
group by c1)
df.cache() // Cache the result, but it's a lazy execution.
df.registerAsTempTable(my_result)
sqlContext.sql(select * from my_result where c1=1).collect // the cache
Yes.
From: Yi Zhang [mailto:zhangy...@yahoo.com]
Sent: Friday, May 15, 2015 2:51 PM
To: Cheng, Hao; User
Subject: Re: What's the advantage features of Spark SQL(JDBC)
@Hao,
As you said, there is no advantage feature for JDBC, it just provides unified
api to support different data sources
Forgot to import the implicit functions/classes?
import sqlContext.implicits._
From: Rajdeep Dua [mailto:rajdeep@gmail.com]
Sent: Monday, May 18, 2015 8:08 AM
To: user@spark.apache.org
Subject: InferredSchema Example in Spark-SQL
Hi All,
Was trying the Inferred Schema spart example
Typo? Should be .toDF(), not .toRD()
From: Ram Sriharsha [mailto:sriharsha@gmail.com]
Sent: Monday, May 18, 2015 8:31 AM
To: Rajdeep Dua
Cc: user
Subject: Re: InferredSchema Example in Spark-SQL
you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring
schema from the
The Dataframe API should be perfectly helpful in this case.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
Some code snippet will like:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import
, but that’s still on going.
Cheng Hao
From: Masf [mailto:masfwo...@gmail.com]
Sent: Thursday, April 2, 2015 11:47 PM
To: user@spark.apache.org
Subject: Spark SQL. Memory consumption
Hi.
I'm using Spark SQL 1.2. I have this query:
CREATE TABLE test_MA STORED AS PARQUET AS
SELECT
Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?
-Original Message-
From: ogoh [mailto:oke...@gmail.com]
Sent: Friday, June 5, 2015 10:10 AM
To: user@spark.apache.org
Subject: SparkSQL : using Hive UDF returning Map throws rror:
scala.MatchError: interface
Confirmed, with latest master, we don't support complex data type for Simple
Hive UDF, do you mind file an issue in jira?
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Friday, June 5, 2015 12:35 PM
To: ogoh; user@spark.apache.org
Subject: RE: SparkSQL : using
Is it the large result set return from the Thrift Server? And can you paste the
SQL and physical plan?
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, June 9, 2015 12:01 PM
To: Sourav Mazumder
Cc: user
Subject: Re: Spark SQL with Thrift Server is very very slow and finally failing
Yes, with should be with HiveContext, not SQLContext.
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Tuesday, June 23, 2015 2:51 AM
To: smazumder
Cc: user
Subject: Re: Support for Windowing and Analytics functions in Spark SQL
1.4 supports it
On 23 Jun 2015 02:59, Sourav Mazumder
It’s actually not that tricky.
SPARK_WORKER_CORES: is the max task thread pool size of the of the executor,
the same saying of “one executor with 32 cores and the executor could execute
32 tasks simultaneously”. Spark doesn’t care about how much real physical
CPU/Cores you have (OS does), so
Not sure if Spark RDD will provide API to fetch the record one by one from the
final result set, instead of the pulling them all / (or whole partition data)
and fit in the driver memory.
Seems a big change.
From: Cheng Lian [mailto:l...@databricks.com]
Sent: Friday, June 12, 2015 3:51 PM
To:
Not sure if Spark Core will provide API to fetch the record one by one from the
block manager, instead of the pulling them all into the driver memory.
From: Cheng Lian [mailto:l...@databricks.com]
Sent: Friday, June 12, 2015 3:51 PM
To: 姜超才; Hester wang; user@spark.apache.org
Subject: Re: 回复:
Yes, it is thread safe. That’s how Spark SQL JDBC Server works.
Cheng Hao
From: V Dineshkumar [mailto:developer.dines...@gmail.com]
Sent: Wednesday, June 17, 2015 9:44 PM
To: user@spark.apache.org
Subject: Is HiveContext Thread Safe?
Hi,
I have a HiveContext which I am using in multiple
Seems you're hitting the self-join, currently Spark SQL won't cache any
result/logical tree for further analyzing or computing for self-join. Since the
logical tree is huge, it's reasonable to take long time in generating its tree
string recursively. And I also doubt the computing can finish
Yes, but be sure you put the hive-site.xml under your class path.
Any problem you meet?
Cheng Hao
From: Sanjay Subramanian [mailto:sanjaysubraman...@yahoo.com.INVALID]
Sent: Thursday, May 28, 2015 8:53 AM
To: user
Subject: Pointing SparkSQL to existing Hive Metadata with data file locations
Thanks for reporting this.
We intend to support the multiple metastore versions in a single
build(hive-0.13.1) by introducing the IsolatedClientLoader, but probably you’re
hitting the bug, please file a jira issue for this.
I will keep investigating on this also.
Hao
From: Mark Hamstra
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false.
BTW, which version are you using?
Hao
From: Jerrick Hoang [mailto:jerrickho...@gmail.com]
Sent: Thursday, August 20, 2015 12:16 PM
To: Philip Weaver
Cc: user
Subject: Re: Spark Sql behaves strangely with tables with
20, 2015 1:46 PM
To: Cheng, Hao
Cc: Philip Weaver; user
Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions
I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs
trying to speed up spark sql with tables with a huge number of partitions, I've
made
The first job is to infer the json schema, and the second one is what you mean
of the query.
You can provide the schema while loading the json file, like below:
sqlContext.read.schema(xxx).json(“…”)?
Hao
From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Monday, August 24, 2015 6:20 PM
To:
1 - 100 of 148 matches
Mail list logo