RE: Is Spark-1.0.0 not backward compatible with Shark-0.9.1 ?

2014-06-10 Thread Cheng, Hao
And if you want to use the SQL CLI (based on catalyst) as it works in Shark, you can also check out https://github.com/amplab/shark/pull/337 :) This preview version doesn’t require the Hive to be setup in the cluster. (Don’t forget to put the hive-site.xml under SHARK_HOME/conf also) Cheng Hao

RE: Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Cheng, Hao
)) sparkContext.makeRDD(rows).registerAsTable(foo) sql(select k,count(*) from foo group by k).collect res1: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300]) Cheng Hao From: Pei-Lun Lee [mailto:pl...@appier.com] Sent: Wednesday, June 11, 2014 6:01 PM To: user@spark.apache.org Subject

RE: Hive From Spark

2014-07-20 Thread Cheng, Hao
Subject: RE: Hive From Spark Hi Cheng Hao, Thank you very much for your reply. Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 . Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt assembly/assembly, including the jar in all the workers, and copying the hive

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
This is a very interesting problem. SparkSQL supports the Non Equi Join, but it is in very low efficiency with large tables. One possible solution is make both table partition based and the partition keys are (cast(ds as bigint) / 240), and with each partition in dataset1, you probably can

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
Actually it's just a pseudo algorithm I described, you can do it with spark API. Hope the algorithm helpful. -Original Message- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 11:56 AM To: u...@spark.incubator.apache.org Subject: RE: Joining by timestamp. Hi Chen,

RE: Joining by timestamp.

2014-07-21 Thread Cheng, Hao
Durga, you can start from the documents http://spark.apache.org/docs/latest/quick-start.html http://spark.apache.org/docs/latest/programming-guide.html -Original Message- From: durga [mailto:durgak...@gmail.com] Sent: Tuesday, July 22, 2014 12:45 PM To:

RE: SparkSQL can not use SchemaRDD from Hive

2014-07-29 Thread Cheng, Hao
In your code snippet, sample is actually a SchemaRDD, and SchemaRDD actually binds a certain SQLContext in runtime, I don't think we can manipulate/share the SchemaRDD across SQLContext Instances. -Original Message- From: Kevin Jung [mailto:itsjb.j...@samsung.com] Sent: Tuesday, July

RE: Substring in Spark SQL

2014-08-04 Thread Cheng, Hao
From the log, I noticed the substr was added on July 15th, 1.0.1 release should be earlier than that. Community is now working on releasing the 1.1.0, and also some of the performance improvements were added. Probably you can try that for your benchmark. Cheng Hao -Original Message

RE: Spark SQL Stackoverflow error

2014-08-14 Thread Cheng, Hao
I couldn’t reproduce the exception, probably it’s solved in the latest code. From: Vishal Vibhandik [mailto:vishal.vibhan...@gmail.com] Sent: Thursday, August 14, 2014 11:17 AM To: user@spark.apache.org Subject: Spark SQL Stackoverflow error Hi, I tried running the sample sql code JavaSparkSQL

RE: Unsupported language features in query

2014-09-02 Thread Cheng, Hao
Currently SparkSQL doesn’t support the row format/serde in CTAS. The work around is create the table first. -Original Message- From: centerqi hu [mailto:cente...@gmail.com] Sent: Tuesday, September 02, 2014 3:35 PM To: user@spark.apache.org Subject: Unsupported language features in

RE: Unsupported language features in query

2014-09-02 Thread Cheng, Hao
[mailto:cente...@gmail.com] Sent: Tuesday, September 02, 2014 3:46 PM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Unsupported language features in query Thanks Cheng Hao Have a way of obtaining spark support hive statement list? Thanks 2014-09-02 15:39 GMT+08:00 Cheng, Hao hao.ch

RE: SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread Cheng, Hao
Hive can launch another job with strategy to merged the small files, probably we can also do that in the future release. From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, September 05, 2014 8:59 AM To: DanteSama Cc: u...@spark.incubator.apache.org Subject: Re: SchemaRDD -

RE: Spark SQL JDBC

2014-09-11 Thread Cheng, Hao
I copied the 3 datanucleus jars (datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar) to the fold lib/ manually, and it works for me. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 11:28 AM To: alexandria1101 Cc:

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread Cheng, Hao
What's your Spark / Hadoop version? And also the hive-site.xml? Most of case like that caused by incompatible Hadoop client jar and the Hadoop cluster. -Original Message- From: linkpatrickliu [mailto:linkpatrick...@live.com] Sent: Monday, September 15, 2014 2:35 PM To:

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread Cheng, Hao
The Hadoop client jar should be assembled into the uber-jar, but (I suspect) it's probably not compatible with your Hadoop Cluster. Can you also paste the Spark uber-jar name? Usually will be under the path lib/spark-assembly-1.1.0-xxx-hadoopxxx.jar. -Original Message- From:

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread Cheng, Hao
Sorry, I am not able to reproduce that. Can you try add the following entry into the hive-site.xml? I know they have the default value, but let's make it explicitly. hive.server2.thrift.port hive.server2.thrift.bind.host hive.server2.authentication (NONE、KERBEROS、LDAP、PAM or CUSTOM)

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Cheng, Hao
Thank you for pasting the steps, I will look at this, hopefully come out with a solution soon. -Original Message- From: linkpatrickliu [mailto:linkpatrick...@live.com] Sent: Tuesday, September 16, 2014 3:17 PM To: u...@spark.incubator.apache.org Subject: RE: SparkSQL 1.1 hang when DROP

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Cheng, Hao
is working on upgrading the Hive to 0.13 for SparkSQL (https://github.com/apache/spark/pull/2241), not sure if you can wait for this. ☺ From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Wednesday, September 17, 2014 1:50 AM To: Cheng, Hao Cc: linkpatrickliu; u...@spark.incubator.apache.org Subject

RE: problem with HiveContext inside Actor

2014-09-17 Thread Cheng, Hao
the null value when retrieving HiveConf. Cheng Hao From: Du Li [mailto:l...@yahoo-inc.com.INVALID] Sent: Thursday, September 18, 2014 7:51 AM To: user@spark.apache.org; d...@spark.apache.org Subject: problem with HiveContext inside Actor Hi, Wonder anybody had similar experience or any suggestion here

RE: Spark SQL parser bug?

2014-10-12 Thread Cheng, Hao
(1::2::Nil).map(i= T(i.toString, new java.sql.Timestamp(i))) data.registerTempTable(x) val s = sqlContext.sql(select a from x where ts='1970-01-01 00:00:00';) s.collect output: res1: Array[org.apache.spark.sql.Row] = Array([1], [2]) Cheng Hao From: Mohammed Guller [mailto:moham

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Cheng, Hao
Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of the data types supported by Catalyst. From: Ge, Yao (Y.) [mailto:y...@ford.com] Sent: Sunday, October 19, 2014 11:44 PM To: Wang, Daoyuan; user@spark.apache.org Subject: RE: scala.MatchError: class java.sql.Timestamp

RE: SchemaRDD Convert

2014-10-22 Thread Cheng, Hao
You needn't do anything, the implicit conversion should do this for you. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L103

RE: spark sql query optimization , and decision tree building

2014-10-22 Thread Cheng, Hao
not sure about how kd tree used in mllib. but keep in mind SchemaRDD is just a normal RDD. Cheng Hao From: sanath kumar [mailto:sanath1...@gmail.com] Sent: Wednesday, October 22, 2014 12:58 PM To: user@spark.apache.org Subject: spark sql query optimization , and decision tree building Hi all

RE: Create table error from Hive in spark-assembly-1.0.2.jar

2014-10-26 Thread Cheng, Hao
Can you paste the hive-site.xml? Most of times I meet this exception, because the JDBC driver for hive metastore are not correct set or wrong driver classes are included in the assembly jar. As default, the assembly jar contains the derby.jar, which is the embedded derby JDBC driver. From:

Support Hive 0.13 .1 in Spark SQL

2014-10-27 Thread Cheng, Hao
. Sorry if I missed some discussion of Hive upgrading. Cheng Hao

Spark SQL Hive Version

2014-11-05 Thread Cheng, Hao
Hi, all, I noticed that when compiling the SparkSQL with profile hive-0.13.1, it will fetch the Hive version of 0.13.1a under groupId org.spark-project.hive, what's the difference with the one of org.apache.hive? And where can I get the source code for re-compiling? Thanks, Cheng Hao

RE: [SQL] PERCENTILE is not working

2014-11-05 Thread Cheng, Hao
Which version are you using? I can reproduce that in the latest code, but with different exception. I've filed an bug https://issues.apache.org/jira/browse/SPARK-4263, can you also add some information there? Thanks, Cheng Hao -Original Message- From: Kevin Paul [mailto:kevinpaulap

RE: SparkSQL Timestamp query failure

2014-11-23 Thread Cheng, Hao
Can you try query like “SELECT timestamp, CAST(timestamp as string) FROM logs LIMIT 5”, I guess you probably ran into the timestamp precision or the timezone shifting problem. (And it’s not mandatory, but you’d better change the field name from “timestamp” to something else, as “timestamp” is

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Cheng, Hao
Are all of your join keys the same? and I guess the join type are all “Left” join, https://github.com/apache/spark/pull/3362 probably is what you need. And, SparkSQL doesn’t support the multiway-join (and multiway-broadcast join) currently, https://github.com/apache/spark/pull/3270 should be

RE: Spark SQL performance and data size constraints

2014-11-26 Thread Cheng, Hao
Spark SQL doesn't support the DISTINCT well currently, particularly the case you described, it will leads all of the data fall into a single node and keep them in memory only. Dev community actually has solutions for this, it probably will be solved after the release of Spark 1.2.

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Cheng, Hao
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Thursday, November 27, 2014 10:24 PM To: Cheng, Hao Cc: user Subject: Re: Auto BroadcastJoin optimization failed in latest Spark Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the links

RE: Spark SQL UDF returning a list?

2014-12-03 Thread Cheng, Hao
/pull/3595 ) b. It expects the function return type to be immutable.Seq[XX] for List, immutable.Map[X, X] for Map, scala.Product for Struct, and only Array[Byte] for binary. The Array[_] is not supported. Cheng Hao From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Thursday, December 4

RE: Spark SQL with a sorted file

2014-12-03 Thread Cheng, Hao
You can try to write your own Relation with filter push down or use the ParquetRelation2 for workaround. (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala) Cheng Hao -Original Message- From: Jerry Raj [mailto:jerry

RE: Can HiveContext be used without using Hive?

2014-12-09 Thread Cheng, Hao
It works exactly like Create Table As Select (CTAS) in Hive. Cheng Hao From: Anas Mosaad [mailto:anas.mos...@incorta.com] Sent: Wednesday, December 10, 2014 11:59 AM To: Michael Armbrust Cc: Manoj Samel; user@spark.apache.org Subject: Re: Can HiveContext be used without using Hive

RE: Why my SQL UDF cannot be registered?

2014-12-15 Thread Cheng, Hao
As the error log shows, you may need to register it as: sqlContext.rgisterFunction(“toHour”, toHour _) The “_” means you are passing the function as parameter, not invoking it. Cheng Hao From: Xuelin Cao [mailto:xuelin...@yahoo.com.INVALID] Sent: Monday, December 15, 2014 5:28 PM To: User

RE: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Cheng, Hao
Hi, Lam, I can confirm this is a bug with the latest master, and I filed a jira issue for this: https://issues.apache.org/jira/browse/SPARK-4944 Hope come with a solution soon. Cheng Hao From: Jerry Lam [mailto:chiling...@gmail.com] Sent: Wednesday, December 24, 2014 4:26 AM To: user

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Cheng, Hao
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: user@spark.apache.org

RE: Escape commas in file names

2014-12-25 Thread Cheng, Hao
multiple parquet files for API sqlContext.parquetFile, we need to think how to support multiple paths in some other way. Cheng Hao From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, December 25, 2014 1:01 PM To: Daniel Siegmann Cc: user@spark.apache.org Subject: Re: Escape

RE: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Cheng, Hao
The log showed it failed in parsing, so the typo stuff shouldn’t be the root cause. BUT I couldn’t reproduce that with master branch. I did the test as follow: sbt/sbt –Phadoop-2.3.0 –Phadoop-2.3 –Phive –Phive-0.13.1 hive/console scala sql(“SELECT user_id FROM actions where

RE: SparkSQL 1.2.0 sources API error

2015-01-18 Thread Cheng, Hao
It seems the netty jar works with an incompatible method signature. Can you check if there different versions of netty jar in your classpath? From: Walrus theCat [mailto:walrusthe...@gmail.com] Sent: Sunday, January 18, 2015 3:37 PM To: user@spark.apache.org Subject: Re: SparkSQL 1.2.0 sources

RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-15 Thread Cheng, Hao
Hi, BB Ideally you can do the query like: select key, value.percent from mytable_data lateral view explode(audiences) f as key, value limit 3; But there is a bug in HiveContext: https://issues.apache.org/jira/browse/SPARK-5237 I am working on it now, hopefully make a patch soon. Cheng

RE: Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Cheng, Hao
The Data Source API probably work for this purpose. It support the column pruning and the Predicate Push Down: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala Examples also can be found in the unit test:

RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-17 Thread Cheng, Hao
Wow, glad to know that it works well, and sorry, the Jira is another issue, which is not the same case here. From: Bagmeet Behera [mailto:bagme...@gmail.com] Sent: Saturday, January 17, 2015 12:47 AM To: Cheng, Hao Subject: Re: using hiveContext to select a nested Map-data-type from

RE: [SQL] Self join with ArrayType columns problems

2015-01-27 Thread Cheng, Hao
The root cause for this probably because the identical “exprId” of the “AttributeReference” existed while do self-join with “temp table” (temp table = resolved logical plan). I will do the bug fixing and JIRA creation. Cheng Hao From: Michael Armbrust [mailto:mich...@databricks.com] Sent

RE: Implement customized Join for SparkSQL

2015-01-05 Thread Cheng, Hao
Can you paste the error log? From: Dai, Kevin [mailto:yun...@ebay.com] Sent: Monday, January 5, 2015 6:29 PM To: user@spark.apache.org Subject: Implement customized Join for SparkSQL Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id A is a

RE: Extract hour from Timestamp in Spark SQL

2015-02-15 Thread Cheng, Hao
Are you using the SQLContext? I think the HiveContext is recommended. Cheng Hao From: Wush Wu [mailto:w...@bridgewell.com] Sent: Thursday, February 12, 2015 2:24 PM To: u...@spark.incubator.apache.org Subject: Extract hour from Timestamp in Spark SQL Dear all, I am new to Spark SQL and have

RE: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-18 Thread Cheng, Hao
Seems the elasticsearch-hadoop project was built with an old version of Spark, and then you upgraded the Spark version in execution env, as I know the StructField changed the definition in Spark 1.2, can you confirm the version problem first? From: Todd Nist [mailto:tsind...@gmail.com] Sent:

RE: Spark SQL Self join with agreegate

2015-03-19 Thread Cheng, Hao
Not so sure your intention, but something like SELECT sum(val1), sum(val2) FROM table GROUP BY src, dest ? -Original Message- From: Shailesh Birari [mailto:sbirar...@gmail.com] Sent: Friday, March 20, 2015 9:31 AM To: user@spark.apache.org Subject: Spark SQL Self join with agreegate

RE: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread Cheng, Hao
Or you need to specify the jars either in configuration or bin/spark-sql --jars mysql-connector-xx.jar From: fightf...@163.com [mailto:fightf...@163.com] Sent: Monday, March 16, 2015 2:04 PM To: sandeep vura; Ted Yu Cc: user Subject: Re: Re: Unable to instantiate

RE: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread Cheng, Hao
It doesn’t take effect if just putting jar files under the lib-managed/jars folder, you need to put that under class path explicitly. From: sandeep vura [mailto:sandeepv...@gmail.com] Sent: Monday, March 16, 2015 2:21 PM To: Cheng, Hao Cc: fightf...@163.com; Ted Yu; user Subject: Re: Re: Unable

RE: Does any one know how to deploy a custom UDAF jar file in SparkSQL?

2015-03-10 Thread Cheng, Hao
You can add the additional jar when submitting your job, something like: ./bin/spark-submit --jars xx.jar … More options can be listed by just typing ./bin/spark-submit From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 10, 2015 8:48 PM To: user@spark.apache.org Subject: Does

RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread Cheng, Hao
Currently, Spark SQL doesn’t provide interface for developing the custom UDTF, but it can work seamless with Hive UDTF. I am working on the UDTF refactoring for Spark SQL, hopefully will provide an Hive independent UDTF soon after that. From: shahab [mailto:shahab.mok...@gmail.com] Sent:

RE: Spark SQL using Hive metastore

2015-03-11 Thread Cheng, Hao
check the configuration file of $SPARK_HOME/conf/spark-xxx.conf ? Cheng Hao From: Grandl Robert [mailto:rgra...@yahoo.com.INVALID] Sent: Thursday, March 12, 2015 5:07 AM To: user@spark.apache.org Subject: Spark SQL using Hive metastore Hi guys, I am a newbie in running Spark SQL / Spark. My goal

RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread Cheng, Hao
/pull/3247 From: shahab [mailto:shahab.mok...@gmail.com] Sent: Wednesday, March 11, 2015 1:44 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Cheng, Hao
I am not so sure if Hive supports change the metastore after initialized, I guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably that's why it doesn't work as expected for Q1. BTW, in most of cases, people configure the metastore settings in hive-site.xml, and will not

RE: SQL with Spark Streaming

2015-03-10 Thread Cheng, Hao
Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. From: Mohit Anchlia [mailto:mohitanch...@gmail.com] Sent: Wednesday, March 11, 2015 8:12 AM To: user@spark.apache.org Subject: SQL with Spark Streaming Does Spark Streaming also

RE: Spark SQL udf(ScalaUdf) is very slow

2015-03-23 Thread Cheng, Hao
This is a very interesting issue, the root reason for the lower performance probably is, in Scala UDF, Spark SQL converts the data type from internal representation to Scala representation via Scala reflection recursively. Can you create a Jira issue for tracking this? I can start to work on

JLine hangs under Windows8

2015-02-27 Thread Cheng, Hao
$.main(SparkSQLCLIDriver.scala:202) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) Thanks, Cheng Hao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

RE: JLine hangs under Windows8

2015-02-27 Thread Cheng, Hao
It works after adding the -Djline.terminal=jline.UnsupportedTerminal -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Saturday, February 28, 2015 10:24 AM To: user@spark.apache.org Subject: JLine hangs under Windows8 Hi, All I was trying to run spark sql cli

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
instance. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, March 3, 2015 7:56 AM To: Cheng, Hao; user Subject: RE: Is SQLContext thread-safe? Thanks for the response. Then I have another question: when will we want to create multiple SQLContext instances

RE: Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2015-03-02 Thread Cheng, Hao
Copy those jars into the $SPARK_HOME/lib/ datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar see https://github.com/apache/spark/blob/master/bin/compute-classpath.sh#L120 -Original Message- From: fanooos [mailto:dev.fano...@gmail.com] Sent: Tuesday,

RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
Can you provide the detailed failure call stack? From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 3, 2015 3:52 PM To: user@spark.apache.org Subject: Supporting Hive features in Spark SQL Thrift JDBC server Hi, According to Spark SQL documentation, Spark SQL supports the

RE: Executing hive query from Spark code

2015-03-02 Thread Cheng, Hao
I am not so sure how Spark SQL compiled in CDH, but if didn’t specify the –Phive and –Phive-thriftserver flags during the build, most likely it will not work if just by providing the Hive lib jars later on. For example, does the HiveContext class exist in the assembly jar? I am also quite

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
https://issues.apache.org/jira/browse/SPARK-2087 https://github.com/apache/spark/pull/4382 I am working on the prototype, but will be updated soon. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, March 3, 2015 8:32 AM To: Cheng, Hao; user Subject: RE

RE: SparkSQL, executing an OR

2015-03-03 Thread Cheng, Hao
Using where('age =10 'age =4) instead. -Original Message- From: Guillermo Ortiz [mailto:konstt2...@gmail.com] Sent: Tuesday, March 3, 2015 5:14 PM To: user Subject: SparkSQL, executing an OR I'm trying to execute a query with Spark. (Example from the Spark Documentation) val teenagers

RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
Hive UDF are only applicable for HiveContext and its subclass instance, is the CassandraAwareSQLContext a direct sub class of HiveContext or SQLContext? From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 3, 2015 5:10 PM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re

RE: insert Hive table with RDD

2015-03-03 Thread Cheng, Hao
Using the SchemaRDD / DataFrame API via HiveContext Assume you're using the latest code, something probably like: val hc = new HiveContext(sc) import hc.implicits._ existedRdd.toDF().insertInto(hivetable) or existedRdd.toDF().registerTempTable(mydata) hc.sql(insert into hivetable as select xxx

RE: java.lang.IncompatibleClassChangeError when using PrunedFilteredScan

2015-03-03 Thread Cheng, Hao
As the call stack shows, the mongodb connector is not compatible with the Spark SQL Data Source interface. The latest Data Source API is changed since 1.2, probably you need to confirm which spark version the MongoDB Connector build against. By the way, a well format call stack will be more

RE: Performance tuning in Spark SQL.

2015-03-02 Thread Cheng, Hao
This is actually a quite open question, from my understanding, there're probably ways to tune like: *SQL Configurations like: Configuration Key Default Value spark.sql.autoBroadcastJoinThreshold 10 * 1024 * 1024 spark.sql.defaultSizeInBytes 10 * 1024 * 1024 + 1

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user Subject: Is SQLContext thread-safe? Hi, is it safe to use the same SQLContext to do Select operations in different threads

RE: Does SparkSQL support ..... having count (fieldname) in SQL statement?

2015-03-04 Thread Cheng, Hao
I’ve tried with latest code, seems it works, which version are you using Shahab? From: yana [mailto:yana.kadiy...@gmail.com] Sent: Wednesday, March 4, 2015 8:47 PM To: shahab; user@spark.apache.org Subject: RE: Does SparkSQL support . having count (fieldname) in SQL statement? I think the

RE: Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2015-03-03 Thread Cheng, Hao
” while starting the spark shell. From: Anusha Shamanur [mailto:anushas...@gmail.com] Sent: Wednesday, March 4, 2015 5:07 AM To: Cheng, Hao Subject: Re: Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory Hi, I am getting

RE: Connection PHP application to Spark Sql thrift server

2015-03-05 Thread Cheng, Hao
Can you query upon Hive? Let's confirm if it's a bug of SparkSQL in your PHP code first. -Original Message- From: fanooos [mailto:dev.fano...@gmail.com] Sent: Thursday, March 5, 2015 4:57 PM To: user@spark.apache.org Subject: Connection PHP application to Spark Sql thrift server We

RE: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-25 Thread Cheng, Hao
How many reducers you set for Hive? With small data set, Hive will run in local mode, which will set the reducer count always as 1. From: Kannan Rajah [mailto:kra...@maprtech.com] Sent: Thursday, February 26, 2015 3:02 AM To: Cheng Lian Cc: user@spark.apache.org Subject: Re: Spark-SQL 1.2.0 sort

RE: Re: problem with spark thrift server

2015-04-23 Thread Cheng, Hao
Hi, can you describe a little bit how the ThriftServer crashed, or steps to reproduce that? It’s probably a bug of ThriftServer. Thanks, From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk] Sent: Friday, April 24, 2015 9:55 AM To: Arush Kharbanda Cc: user Subject: Re: Re: problem

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Can you print out the physical plan? EXPLAIN SELECT xxx… From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics

Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Cheng, Hao
I assume you’re using the DataFrame API within your application. sql(“SELECT…”).explain(true) From: Wang, Daoyuan Sent: Tuesday, May 5, 2015 10:16 AM To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. You can use

RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-05 Thread Cheng, Hao
, Hao; Wang, Daoyuan; Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining_2_tables. Hi guys, attache the pic of physical plan and logs.Thanks. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Cheng, Hao hao.ch

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Or, have you ever try broadcast join? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, May 5, 2015 8:33 AM To: luohui20...@sina.com; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Cheng, Hao
Spark SQL just take the JDBC as a new data source, the same as we need to support loading data from a .csv or .json. From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 2:30 PM To: User Subject: What's the advantage features of Spark SQL(JDBC) Hi All, Comparing

RE: question about sparksql caching

2015-05-15 Thread Cheng, Hao
You probably can try something like: val df = sqlContext.sql(select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1) df.cache() // Cache the result, but it's a lazy execution. df.registerAsTempTable(my_result) sqlContext.sql(select * from my_result where c1=1).collect // the cache

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Cheng, Hao
Yes. From: Yi Zhang [mailto:zhangy...@yahoo.com] Sent: Friday, May 15, 2015 2:51 PM To: Cheng, Hao; User Subject: Re: What's the advantage features of Spark SQL(JDBC) @Hao, As you said, there is no advantage feature for JDBC, it just provides unified api to support different data sources

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao
Forgot to import the implicit functions/classes? import sqlContext.implicits._ From: Rajdeep Dua [mailto:rajdeep@gmail.com] Sent: Monday, May 18, 2015 8:08 AM To: user@spark.apache.org Subject: InferredSchema Example in Spark-SQL Hi All, Was trying the Inferred Schema spart example

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao
Typo? Should be .toDF(), not .toRD() From: Ram Sriharsha [mailto:sriharsha@gmail.com] Sent: Monday, May 18, 2015 8:31 AM To: Rajdeep Dua Cc: user Subject: Re: InferredSchema Example in Spark-SQL you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring schema from the

RE: Spark Avarage

2015-04-06 Thread Cheng, Hao
The Dataframe API should be perfectly helpful in this case. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html Some code snippet will like: val sqlContext = new org.apache.spark.sql.SQLContext(sc) // this is used to implicitly convert an RDD to a DataFrame. import

RE: Spark SQL. Memory consumption

2015-04-02 Thread Cheng, Hao
, but that’s still on going. Cheng Hao From: Masf [mailto:masfwo...@gmail.com] Sent: Thursday, April 2, 2015 11:47 PM To: user@spark.apache.org Subject: Spark SQL. Memory consumption Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT

RE: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-04 Thread Cheng, Hao
Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0? -Original Message- From: ogoh [mailto:oke...@gmail.com] Sent: Friday, June 5, 2015 10:10 AM To: user@spark.apache.org Subject: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface

RE: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-05 Thread Cheng, Hao
Confirmed, with latest master, we don't support complex data type for Simple Hive UDF, do you mind file an issue in jira? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Friday, June 5, 2015 12:35 PM To: ogoh; user@spark.apache.org Subject: RE: SparkSQL : using

RE: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-09 Thread Cheng, Hao
Is it the large result set return from the Thrift Server? And can you paste the SQL and physical plan? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, June 9, 2015 12:01 PM To: Sourav Mazumder Cc: user Subject: Re: Spark SQL with Thrift Server is very very slow and finally failing

RE: Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread Cheng, Hao
Yes, with should be with HiveContext, not SQLContext. From: ayan guha [mailto:guha.a...@gmail.com] Sent: Tuesday, June 23, 2015 2:51 AM To: smazumder Cc: user Subject: Re: Support for Windowing and Analytics functions in Spark SQL 1.4 supports it On 23 Jun 2015 02:59, Sourav Mazumder

RE: Question about SPARK_WORKER_CORES and spark.task.cpus

2015-06-22 Thread Cheng, Hao
It’s actually not that tricky. SPARK_WORKER_CORES: is the max task thread pool size of the of the executor, the same saying of “one executor with 32 cores and the executor could execute 32 tasks simultaneously”. Spark doesn’t care about how much real physical CPU/Cores you have (OS does), so

RE: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng, Hao
Not sure if Spark RDD will provide API to fetch the record one by one from the final result set, instead of the pulling them all / (or whole partition data) and fit in the driver memory. Seems a big change. From: Cheng Lian [mailto:l...@databricks.com] Sent: Friday, June 12, 2015 3:51 PM To:

RE: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng, Hao
Not sure if Spark Core will provide API to fetch the record one by one from the block manager, instead of the pulling them all into the driver memory. From: Cheng Lian [mailto:l...@databricks.com] Sent: Friday, June 12, 2015 3:51 PM To: 姜超才; Hester wang; user@spark.apache.org Subject: Re: 回复:

RE: Is HiveContext Thread Safe?

2015-06-17 Thread Cheng, Hao
Yes, it is thread safe. That’s how Spark SQL JDBC Server works. Cheng Hao From: V Dineshkumar [mailto:developer.dines...@gmail.com] Sent: Wednesday, June 17, 2015 9:44 PM To: user@spark.apache.org Subject: Is HiveContext Thread Safe? Hi, I have a HiveContext which I am using in multiple

RE: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Cheng, Hao
Seems you're hitting the self-join, currently Spark SQL won't cache any result/logical tree for further analyzing or computing for self-join. Since the logical tree is huge, it's reasonable to take long time in generating its tree string recursively. And I also doubt the computing can finish

RE: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Cheng, Hao
Yes, but be sure you put the hive-site.xml under your class path. Any problem you meet? Cheng Hao From: Sanjay Subramanian [mailto:sanjaysubraman...@yahoo.com.INVALID] Sent: Thursday, May 28, 2015 8:53 AM To: user Subject: Pointing SparkSQL to existing Hive Metadata with data file locations

RE: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Cheng, Hao
Thanks for reporting this. We intend to support the multiple metastore versions in a single build(hive-0.13.1) by introducing the IsolatedClientLoader, but probably you’re hitting the bug, please file a jira issue for this. I will keep investigating on this also. Hao From: Mark Hamstra

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false. BTW, which version are you using? Hao From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August 20, 2015 12:16 PM To: Philip Weaver Cc: user Subject: Re: Spark Sql behaves strangely with tables with

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
20, 2015 1:46 PM To: Cheng, Hao Cc: Philip Weaver; user Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs trying to speed up spark sql with tables with a huge number of partitions, I've made

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
The first job is to infer the json schema, and the second one is what you mean of the query. You can provide the schema while loading the json file, like below: sqlContext.read.schema(xxx).json(“…”)? Hao From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Monday, August 24, 2015 6:20 PM To:

  1   2   >