RE: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Cheng, Hao
Congratulations!! Jerry, you really deserve it.

Hao

-Original Message-
From: Mridul Muralidharan [mailto:mri...@gmail.com] 
Sent: Tuesday, August 29, 2017 12:04 PM
To: Matei Zaharia 
Cc: dev ; Saisai Shao 
Subject: Re: Welcoming Saisai (Jerry) Shao as a committer

Congratulations Jerry, well deserved !

Regards,
Mridul

On Mon, Aug 28, 2017 at 6:28 PM, Matei Zaharia  wrote:
> Hi everyone,
>
> The PMC recently voted to add Saisai (Jerry) Shao as a committer. Saisai has 
> been contributing to many areas of the project for a long time, so it’s great 
> to see him join. Join me in thanking and congratulating him!
>
> Matei
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Cheng, Hao
-1

Breaks the existing applications while using the Script Transformation in Spark 
SQL, as the default Record/Column delimiter class changed since we don’t get 
the default conf value from HiveConf any more, see SPARK-16515;

This is a regression.


From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, July 15, 2016 7:26 AM
To: dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

Updated documentation at 
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs-updated/



On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin 
> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.0.0. 
The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.0
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.0-rc4 (e5f8c1117e0c48499f54d62b556bc693435afae0).

This release candidate resolves ~2500 issues: 
https://s.apache.org/spark-2.0.0-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1192/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/


=
How can I help test this release?
=
If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions from 1.x.

==
What justifies a -1 vote for this release?
==
Critical bugs impacting major functionalities.

Bugs already present in 1.x, missing features, or bugs related to new features 
will not necessarily block this release. Note that historically Spark 
documentation has been published on the website separately from the main 
release so we do not need to block the release due to documentation errors 
either.


Note: There was a mistake made during "rc3" preparation, and as a result there 
is no "rc3", but only "rc4".




RE: new datasource

2015-11-19 Thread Cheng, Hao
I think you probably need to write some code as you need to support the ES, 
there are 2 options per my understanding:

Create a new Data Source from scratch, but you probably need to overwrite the 
interface at:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L751

Or you can reuse most of code in ParquetRelation in the new DataSource, but 
also need to modify your own logic, see
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L285

Hope it helpful.

Hao
From: james.gre...@baesystems.com [mailto:james.gre...@baesystems.com]
Sent: Thursday, November 19, 2015 11:14 PM
To: dev@spark.apache.org
Subject: new datasource



We have written a new Spark DataSource that uses both Parquet and 
ElasticSearch.  It is based on the existing Parquet DataSource.   When I look 
at the filters being pushed down to buildScan I don’t get anything representing 
any filters based on UDFs – or for any fields generated by an explode – I had 
thought if I made it a CatalystScan I would get everything I needed.



This is fine from the Parquet point of view – but we are using ElasticSearch to 
index/filter the data we are searching and I need to be able to capture the UDF 
conditions – or have access to the Plan AST in order that I can construct a 
query for ElasticSearch.



I am thinking I might just need to patch Spark to do this – but I’d prefer not 
too if there is a way of getting round this without hacking the core code.  Any 
ideas?



Thanks



James


Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.


RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
I am not sure what the best practice for this specific problem, but it’s really 
worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API 
only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kos...@cloudera.com]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org; Reynold Xin
Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with 
Spark 2.0 we can also look at better classpath isolation with user programs. I 
propose we build on spark.{driver|executor}.userClassPathFirst, setting it true 
by default, and not allow any spark transitive dependencies to leak into user 
code. For backwards compatibility we can have a whitelist if we want but I'd be 
good if we start requiring user apps to explicitly pull in all their 
dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas 
> wrote:

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 
2.0 without removing or replacing them immediately. That way 2.0 doesn’t have 
to wait for everything that we want to deprecate to be replaced all at once.

Nick
​

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
> wrote:
Parameter Server is a new feature and thus does not match the goal of 2.0 is 
“to fix things that are broken in the current API and remove certain deprecated 
APIs”. At the same time I would be happy to have that feature.

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.
With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

Best regards, Alexander

From: Nan Zhu [mailto:zhunanmcg...@gmail.com]
Sent: Thursday, November 12, 2015 7:28 AM
To: wi...@qq.com
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,

--
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com 
wrote:
Who has the idea of machine learning? Spark missing some features for machine 
learning, For example, the parameter server.


在 2015年11月12日,05:32,Matei Zaharia 
> 写道:

I like the idea of popping out Tachyon to an optional component too to reduce 
the number of dependencies. In the future, it might even be useful to do this 
for Hadoop, but it requires too many API changes to be worth doing now.

Regarding Scala 2.12, we should definitely support it eventually, but I don't 
think we need to block 2.0 on that because it can be added later too. Has 
anyone investigated what it would take to run on there? I imagine we don't need 
many code changes, just maybe some REPL stuff.

Needless to say, but I'm all for the idea of making "major" releases as 
undisruptive as possible in the model Reynold proposed. Keeping everyone 
working with the same set of releases is super important.

Matei

On Nov 11, 2015, at 4:58 AM, Sean Owen 
> wrote:

On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin 
> wrote:
to the Spark community. A major release should not be very different from a
minor release and should not be gated based on new features. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs (examples follow).

Agree with this stance. Generally, a major release might also be a
time to replace some big old API or implementation with a new one, but
I don't see obvious candidates.

I wouldn't mind turning attention to 2.x sooner than later, unless
there's a fairly good reason to continue adding features in 1.x to a
1.7 release. The scope as of 1.6 is already pretty darned big.


1. Scala 2.11 as the default build. We should still 

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
Agree, more features/apis/optimization need to be added in DF/DS.

I mean, we need to think about what kind of RDD APIs we have to provide to 
developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..  
But PairRDDFunctions probably not in this category, as we can do the same thing 
easily with DF/DS, even better performance.

From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Friday, November 13, 2015 11:23 AM
To: Stephen Boesch
Cc: dev@spark.apache.org
Subject: Re: A proposal for Spark 2.0

Hmmm... to me, that seems like precisely the kind of thing that argues for 
retaining the RDD API but not as the first thing presented to new Spark 
developers: "Here's how to use groupBy with DataFrames Until the optimizer 
is more fully developed, that won't always get you the best performance that 
can be obtained.  In these particular circumstances, ..., you may want to use 
the low-level RDD API while setting preservesPartitioning to true.  Like 
this"

On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
<java...@gmail.com<mailto:java...@gmail.com>> wrote:
My understanding is that  the RDD's presently have more support for complete 
control of partitioning which is a key consideration at scale.  While 
partitioning control is still piecemeal in  DF/DS  it would seem premature to 
make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation (/RDD) 
is already partitioned on the grouping expressions.  AFAIK the spark sql still 
does not allow that knowledge to be applied to the optimizer - so a full 
shuffle will be performed. However in the native RDD we can use 
preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra 
<m...@clearstorydata.com<mailto:m...@clearstorydata.com>>:
The place of the RDD API in 2.0 is also something I've been wondering about.  I 
think it may be going too far to deprecate it, but changing emphasis is 
something that we might consider.  The RDD API came well before DataFrames and 
DataSets, so programming guides, introductory how-to articles and the like 
have, to this point, also tended to emphasize RDDs -- or at least to deal with 
them early.  What I'm thinking is that with 2.0 maybe we should overhaul all 
the documentation to de-emphasize and reposition RDDs.  In this scheme, 
DataFrames and DataSets would be introduced and fully addressed before RDDs.  
They would be presented as the normal/default/standard way to do things in 
Spark.  RDDs, in contrast, would be presented later as a kind of lower-level, 
closer-to-the-metal API that can be used in atypical, more specialized contexts 
where DataFrames or DataSets don't fully fit.

On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
I am not sure what the best practice for this specific problem, but it’s really 
worth to think about it in 2.0, as it is a painful issue for lots of users.

By the way, is it also an opportunity to deprecate the RDD API (or internal API 
only?)? As lots of its functionality overlapping with DataFrame or DataSet.

Hao

From: Kostas Sakellis [mailto:kos...@cloudera.com<mailto:kos...@cloudera.com>]
Sent: Friday, November 13, 2015 5:27 AM
To: Nicholas Chammas
Cc: Ulanov, Alexander; Nan Zhu; wi...@qq.com<mailto:wi...@qq.com>; 
dev@spark.apache.org<mailto:dev@spark.apache.org>; Reynold Xin

Subject: Re: A proposal for Spark 2.0

I know we want to keep breaking changes to a minimum but I'm hoping that with 
Spark 2.0 we can also look at better classpath isolation with user programs. I 
propose we build on spark.{driver|executor}.userClassPathFirst, setting it true 
by default, and not allow any spark transitive dependencies to leak into user 
code. For backwards compatibility we can have a whitelist if we want but I'd be 
good if we start requiring user apps to explicitly pull in all their 
dependencies. From what I can tell, Hadoop 3 is also moving in this direction.

Kostas

On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas 
<nicholas.cham...@gmail.com<mailto:nicholas.cham...@gmail.com>> wrote:

With regards to Machine learning, it would be great to move useful features 
from MLlib to ML and deprecate the former. Current structure of two separate 
machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in GraphX 
and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some things in 
2.0 without removing or replacing them immediately. That way 2.0 doesn’t have 
to wait for everything that we want to deprecate to be replaced all at once.

Nick
​

On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
<alexander.ula...@hpe.com<mailto:alexander.ula...@hpe.com>> wrote:
Parameter Server is a new feature and thus does not match 

RE: Sort Merge Join from the filesystem

2015-11-09 Thread Cheng, Hao
Yes, we definitely need to think how to handle this case, probably even more 
common than both sorted/partitioned tables case, can you jump to the jira and 
leave comment there?

From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
Sent: Tuesday, November 10, 2015 3:03 AM
To: Cheng, Hao
Cc: Reynold Xin; dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

Thanks for creating that ticket.

Another thing I was thinking of, is doing this type of join between dataset A 
which is already partitioned/sorted on disk and dataset B, which gets generated 
during the run of the application.

Dataset B would need something like repartitionAndSortWithinPartitions to be 
performed on it, using the same partitioner that was used with dataset A. Then 
dataset B could be joined with dataset A without needing to write it to disk 
first (unless it's too big to fit in memory, then it would need to be 
[partially] spilled).

On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Yes, we probably need more change for the data source API if we need to 
implement it in a generic way.
BTW, I create the JIRA by copy most of words from Alex. ☺

https://issues.apache.org/jira/browse/SPARK-11512


From: Reynold Xin [mailto:r...@databricks.com<mailto:r...@databricks.com>]
Sent: Thursday, November 5, 2015 1:36 AM
To: Alex Nastetsky
Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Re: Sort Merge Join from the filesystem

It's not supported yet, and not sure if there is a ticket for it. I don't think 
there is anything fundamentally hard here either.


On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky 
<alex.nastet...@vervemobile.com<mailto:alex.nastet...@vervemobile.com>> wrote:
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file system 
that have already been partitioned the same with the same number of partitions 
and sorted within each partition, without needing to repartition/sort them 
again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does the 
Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be 
joined with dataset B before B even exists, by pre-processing A with a 
partition/sort.

Thanks.




RE: dataframe slow down with tungsten turn on

2015-11-05 Thread Cheng, Hao
What’s the big size of the raw data and the result data? Is that any other 
changes like HDFS, Spark configuration, your own code etc. besides the Spark 
binary? Can you monitor the IO/CPU state while executing the final stage, and 
it will be great if you can paste the call stack if you observe the high CPU 
utilization.

And can you try not to cache anything and repeat the same step? Just be sure 
it’s not caused by the memory stuff.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Friday, November 6, 2015 12:18 AM
To: dev@spark.apache.org
Subject: Fwd: dataframe slow down with tungsten turn on


-- Forwarded message --
From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>>
Date: Fri, Nov 6, 2015 at 12:14 AM
Subject: Re: dataframe slow down with tungsten turn on
To: "Cheng, Hao" <hao.ch...@intel.com<mailto:hao.ch...@intel.com>>

Hi,

My application is as follows:
1. create dataframe from hive table
2. transform dataframe to rdd of json and do some aggregations on json (in 
fact, I use pyspark, so it is rdd of dict)
3. retransform rdd of json to dataframe and cache it (triggered by count)
4. join several dataframe which is created by the above steps.
5. save final dataframe as json.(by dataframe write api)

There are a lot of stages, other stage is quite the same under two version of 
spark. However, the final step (save as json) is 1 min vs. 2 hour. In my 
opinion, I think it is writing to hdfs cause the slowness of final stage. 
However, I don't know why...

In fact, I make a mistake about the version of spark that I used. The spark 
which runs faster is build on source code of spark 1.4.1. The spark which runs 
slower is build on source code of spark 1.5.2, 2 days ago.

Any idea? Thanks a lot.

Cheers
Gen


On Thu, Nov 5, 2015 at 1:01 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on 
the ETL work?

From: Cheng, Hao [mailto:hao.ch...@intel.com<mailto:hao.ch...@intel.com>]
Sent: Thursday, November 5, 2015 12:56 PM
To: gen tang; dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: RE: dataframe slow down with tungsten turn on

1.5 has critical performance / bug issues, you’d better try 1.5.1 or 1.5.2rc 
version.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, November 5, 2015 12:43 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Fwd: dataframe slow down with tungsten turn on

Hi,

In fact, I tested the same code with spark 1.5 with tungsten turning off. The 
result is quite the same as tungsten turning on.
It seems that it is not the problem of tungsten, it is simply that spark 1.5 is 
slower than spark 1.4.

Is there any idea about why it happens?
Thanks a lot in advance

Cheers
Gen


-- Forwarded message --
From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>>
Date: Wed, Nov 4, 2015 at 3:54 PM
Subject: dataframe slow down with tungsten turn on
To: "u...@spark.apache.org<mailto:u...@spark.apache.org>" 
<u...@spark.apache.org<mailto:u...@spark.apache.org>>
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations. And 
then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins. 
However, when I use the same code with spark-1.5.1(with tungsten turn on), it 
takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by computation.
[https://owa.gf.com.cn/owa/service.svc/s/GetFileAttachment?id=AAMkAGEzNGJiN2Q4LTI2ODYtNGIyYS1hYWIyLTMzMTYxOGQzYTViNABGAACPuqp5iM6mRqg7wmvE6c8KBwBKGW%2B6dpgjRb4BfC%2BACXJIAAEPAABKGW%2B6dpgjRb4BfC%2BACXJIQcF3AAABEgAQAIeCeL7UEe9GhqECpYfXhDI%3D=7U3OIyan90CkQzeCMSlDnFM6WrDs5NIIksHvCIBBNwcmtRNW4tO1_1WPFeb51C1IsASUo1jqj_A.]
Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen






RE: Sort Merge Join from the filesystem

2015-11-04 Thread Cheng, Hao
Yes, we probably need more change for the data source API if we need to 
implement it in a generic way.
BTW, I create the JIRA by copy most of words from Alex. ☺

https://issues.apache.org/jira/browse/SPARK-11512


From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, November 5, 2015 1:36 AM
To: Alex Nastetsky
Cc: dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

It's not supported yet, and not sure if there is a ticket for it. I don't think 
there is anything fundamentally hard here either.


On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky 
> wrote:
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file system 
that have already been partitioned the same with the same number of partitions 
and sorted within each partition, without needing to repartition/sort them 
again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does the 
Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be 
joined with dataset B before B even exists, by pre-processing A with a 
partition/sort.

Thanks.



RE: dataframe slow down with tungsten turn on

2015-11-04 Thread Cheng, Hao
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on 
the ETL work?

From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Thursday, November 5, 2015 12:56 PM
To: gen tang; dev@spark.apache.org
Subject: RE: dataframe slow down with tungsten turn on

1.5 has critical performance / bug issues, you’d better try 1.5.1 or 1.5.2rc 
version.

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Thursday, November 5, 2015 12:43 PM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Fwd: dataframe slow down with tungsten turn on

Hi,

In fact, I tested the same code with spark 1.5 with tungsten turning off. The 
result is quite the same as tungsten turning on.
It seems that it is not the problem of tungsten, it is simply that spark 1.5 is 
slower than spark 1.4.

Is there any idea about why it happens?
Thanks a lot in advance

Cheers
Gen


-- Forwarded message --
From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>>
Date: Wed, Nov 4, 2015 at 3:54 PM
Subject: dataframe slow down with tungsten turn on
To: "u...@spark.apache.org<mailto:u...@spark.apache.org>" 
<u...@spark.apache.org<mailto:u...@spark.apache.org>>
Hi sparkers,

I am using dataframe to do some large ETL jobs.
More precisely, I create dataframe from HIVE table and do some operations. And 
then I save it as json.

When I used spark-1.4.1, the whole process is quite fast, about 1 mins. 
However, when I use the same code with spark-1.5.1(with tungsten turn on), it 
takes a about 2 hours to finish the same job.

I checked the detail of tasks, almost all the time is consumed by computation.
[https://owa.gf.com.cn/owa/service.svc/s/GetFileAttachment?id=AAMkAGEzNGJiN2Q4LTI2ODYtNGIyYS1hYWIyLTMzMTYxOGQzYTViNABGAACPuqp5iM6mRqg7wmvE6c8KBwBKGW%2B6dpgjRb4BfC%2BACXJIAAEPAABKGW%2B6dpgjRb4BfC%2BACXJIQcF3AAABEgAQAIeCeL7UEe9GhqECpYfXhDI%3D=7U3OIyan90CkQzeCMSlDnFM6WrDs5NIIksHvCIBBNwcmtRNW4tO1_1WPFeb51C1IsASUo1jqj_A.]
Any idea about why this happens?

Thanks a lot in advance for your help.

Cheers
Gen




RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
Probably 2 reasons:

1.  HadoopFsRelation was introduced since 1.4, but seems CsvRelation was 
created based on 1.3

2.  HadoopFsRelation introduces the concept of Partition, which probably 
not necessary for LibSVMRelation.

But I think it will be easy to change as extending from HadoopFsRelation.

Hao

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, November 5, 2015 10:31 AM
To: dev@spark.apache.org
Subject: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?


Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends 
HadoopFsRelation and leverage the features from HadoopFsRelation.  Any other 
consideration for that ?


--
Best Regards

Jeff Zhang


RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
I think you’re right, we do offer the opportunity for developers to make 
mistakes while implementing the new Data Source.

Here we assume that the new relation MUST NOT extends more than one trait of 
the CatalystScan, TableScan, PrunedScan, PrunedFilteredScan , etc. otherwise it 
will causes problem as you described, probably we can add additional checking / 
reporting rule for the abuse.


From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Thursday, November 5, 2015 1:55 PM
To: Cheng, Hao
Cc: dev@spark.apache.org
Subject: Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

Thanks Hao. I have ready made it extends HadoopFsRelation and it works. Will 
create a jira for that.

Besides that, I noticed that in DataSourceStrategy, spark build physical plan 
based on the trait of the BaseRelation in pattern matching (e.g. CatalystScan, 
TableScan, HadoopFsRelation). That means the order matters. I think it is risky 
because that means one BaseRelation can't extends more than 2 of these traits. 
And seems there's no place to restrict to extends more than 2 traits. Maybe 
needs to clean and reorganize these traits otherwise user may meets some weird 
issue when developing new DataSource.



On Thu, Nov 5, 2015 at 1:16 PM, Cheng, Hao 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
Probably 2 reasons:

1.  HadoopFsRelation was introduced since 1.4, but seems CsvRelation was 
created based on 1.3

2.  HadoopFsRelation introduces the concept of Partition, which probably 
not necessary for LibSVMRelation.

But I think it will be easy to change as extending from HadoopFsRelation.

Hao

From: Jeff Zhang [mailto:zjf...@gmail.com<mailto:zjf...@gmail.com>]
Sent: Thursday, November 5, 2015 10:31 AM
To: dev@spark.apache.org<mailto:dev@spark.apache.org>
Subject: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?


Not sure the reason,  it seems LibSVMRelation and CsvRelation can extends 
HadoopFsRelation and leverage the features from HadoopFsRelation.  Any other 
consideration for that ?


--
Best Regards

Jeff Zhang



--
Best Regards

Jeff Zhang


RE: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Cheng, Hao
We actually meet the similiar problem in a real case, see 
https://issues.apache.org/jira/browse/SPARK-10474

After checking the source code, the external sort memory management strategy 
seems the root cause of the issue.

Currently, we allocate the 4MB (page size) buffer as initial in the beginning 
of the sorting, and during the processing of each input record, we possible run 
into the cycle of spill => de-allocate buffer => try allocate a buffer with 
size x2. I know this strategy is quite flexible in some cases. However, for 
example in a data skew case, says 2 tasks with large amount of records runs at 
a single executor, the keep growing buffer strategy will eventually eat out the 
pre-set offheap memory threshold, and then exception thrown like what we’ve 
seen.

I mean can we just take a simple memory management strategy for external 
sorter, like:
Step 1) Allocate a fixed size  buffer for the current task (maybe: 
MAX_MEMORY_THRESHOLD/(2 * PARALLEL_TASKS_PER_EXECUTOR))
Step 2) for (record in the input) { if (hasMemoryForRecord(record)) 
insert(record) else spill(buffer); insert(record); }
Step 3) Deallocate(buffer)

Probably we’d better to move the discussion in jira.
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, September 17, 2015 12:28 AM
To: Pete Robbins
Cc: Dev
Subject: Re: Unable to acquire memory errors in HiveCompatibilitySuite

SparkEnv for the driver was created in SparkContext. The default parallelism 
field is set to the number of slots (max number of active tasks). Maybe we can 
just use the default parallelism to compute that in local mode.

On Wednesday, September 16, 2015, Pete Robbins 
> wrote:
so forcing the ShuffleMemoryManager to assume 32 cores and therefore calculate 
a pagesize of 1MB passes the tests.
How can we determine the correct value to use in getPageSize rather than 
Runtime.getRuntime.availableProcessors()?

On 16 September 2015 at 10:17, Pete Robbins 
> 
wrote:
I see what you are saying. Full stack trace:

java.io.IOException: Unable to acquire 4194304 bytes of memory
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:349)
  at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:478)
  at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
  at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
  at 

RE: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-06 Thread Cheng, Hao
Not sure if it’s too late, but we found a critical bug at 
https://issues.apache.org/jira/browse/SPARK-10466
UnsafeRow ser/de will cause assert error, particularly for sort-based shuffle 
with data spill, this is not acceptable as it’s very common in a large table 
joins.

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Saturday, September 5, 2015 3:30 PM
To: Krishna Sankar
Cc: Davies Liu; Yin Huai; Tom Graves; dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

Thanks, Krishna, for the report. We should fix your problem using the Python 
UDFs in 1.6 too.

I'm going to close this vote now. Thanks everybody for voting. This vote passes 
with 8 +1 votes (3 binding) and no 0 or -1 votes.

+1:
Reynold Xin*
Tom Graves*
Burak Yavuz
Michael Armbrust*
Davies Liu
Forest Fang
Krishna Sankar
Denny Lee

0:

-1:


I will work on packaging this release in the next few days.



On Fri, Sep 4, 2015 at 8:08 PM, Krishna Sankar 
> wrote:
Excellent & Thanks Davies. Yep, now runs fine and takes 1/2 the time !
This was exactly why I had put in the elapsed time calculations.
And thanks for the new pyspark.sql.functions.

+1 from my side for 1.5.0 RC3.
Cheers


On Fri, Sep 4, 2015 at 9:57 PM, Davies Liu 
> wrote:
Could you update the notebook to use builtin SQL function month and year,
instead of Python UDF? (they are introduced in 1.5).

Once remove those two udfs, it runs successfully, also much faster.

On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar 
> wrote:
> Yin,
>It is the
> https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> Cheers
> 
>
> On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai 
> > wrote:
>>
>> Hi Krishna,
>>
>> Can you share your code to reproduce the memory allocation issue?
>>
>> Thanks,
>>
>> Yin
>>
>> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar 
>> >
>> wrote:
>>>
>>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>>> Now my vote is +1/2 unless the memory error is known and has a
>>> workaround.
>>>
>>> Cheers
>>> 
>>>
>>>
>>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves 
>>> > wrote:

 The upper/lower case thing is known.
 https://issues.apache.org/jira/browse/SPARK-9550
 I assume it was decided to be ok and its going to be in the release
 notes  but Reynold or Josh can probably speak to it more.

 Tom



 On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
 > wrote:


 +?

 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
 2. Tested pyspark, mllib
 2.1. statistics (min,max,mean,Pearson,Spearman) OK
 2.2. Linear/Ridge/Laso Regression OK
 2.3. Decision Tree, Naive Bayes OK
 2.4. KMeans OK
Center And Scale OK
 2.5. RDD operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word
 count)
 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
Model evaluation/optimization (rank, numIter, lambda) with
 itertools OK
 3. Scala - MLlib
 3.1. statistics (min,max,mean,Pearson,Spearman) OK
 3.2. LinearRegressionWithSGD OK
 3.3. Decision Tree OK
 3.4. KMeans OK
 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 3.6. saveAsParquetFile OK
 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
 registerTempTable, sql OK
 3.8. result = sqlContext.sql("SELECT
 OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
 JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
 4.0. Spark SQL from Python OK
 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
 OK
 5.0. Packages
 5.1. com.databricks.spark.csv - read/write OK
 (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
 com.databricks:spark-csv_2.11:1.2.0 worked)
 6.0. DataFrames
 6.1. cast,dtypes OK
 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
 6.3. All joins,sql,set operations,udf OK

 Two Problems:

 1. The synthetic column names are lowercase ( i.e. now
 ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
 previously 'AVG(Total)'). So programs that depend on the case of the
 synthetic column names would fail.
 2. orders_3.groupBy("Year","Month").sum('Total').show()
 fails with the error ‘java.io.IOException: Unable to acquire 4194304
 bytes of memory’
 orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
 with the same error
 Is this a known bug ?
 Cheers
 

RE: Automatically deleting pull request comments left by AmplabJenkins

2015-08-13 Thread Cheng, Hao
I found the https://spark-prs.appspot.com/ is super slow while open it in a new 
window recently, not sure just myself or everybody experience the same, is 
there anyways to speed up?

From: Josh Rosen [mailto:rosenvi...@gmail.com]
Sent: Friday, August 14, 2015 10:21 AM
To: dev
Subject: Re: Automatically deleting pull request comments left by AmplabJenkins

Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59

On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen 
rosenvi...@gmail.commailto:rosenvi...@gmail.com wrote:
TL;DR: would anyone object if I wrote a script to auto-delete pull request 
comments from AmplabJenkins?

Currently there are two bots which post Jenkins test result comments to GitHub, 
AmplabJenkins and SparkQA.

SparkQA is the account which post the detailed Jenkins start and finish 
messages that contain information on which commit is being tested and which 
tests have failed. This bot is controlled via the dev/run-tests-jenkins script.

AmplabJenkins is controlled by the Jenkins GitHub Pull Request Builder plugin. 
This bot posts relatively uninformative comments (Merge build triggered, 
Merge build started, Merge build failed) that do not contain any links or 
details specific to the tests being run.

It is technically non-trivial prevent these AmplabJenkins comments from being 
posted in the first place (see 
https://issues.apache.org/jira/browse/SPARK-4216).

However, as a short-term hack I'd like to deploy a script which automatically 
deletes these comments as soon as they're posted, with an exemption carved out 
for the Can an admin approve this patch for testing? messages. This will help 
to significantly de-clutter pull request discussions in the GitHub UI.

If nobody objects, I'd like to deploy this script sometime in the next few days.

(From a technical perspective, my script uses the GitHub REST API and 
AmplabJenkins' own OAuth token to delete the comments.  The final deployment 
environment will most likely be the backend of http://spark-prs.appspot.com).

- Josh



RE: Automatically deleting pull request comments left by AmplabJenkins

2015-08-13 Thread Cheng, Hao
OK, thanks, probably just myself…

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, August 14, 2015 11:04 AM
To: Cheng, Hao
Cc: Josh Rosen; dev
Subject: Re: Automatically deleting pull request comments left by AmplabJenkins

I tried accessing just now.
It took several seconds before the page showed up.

FYI

On Thu, Aug 13, 2015 at 7:56 PM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
I found the https://spark-prs.appspot.com/ is super slow while open it in a new 
window recently, not sure just myself or everybody experience the same, is 
there anyways to speed up?

From: Josh Rosen [mailto:rosenvi...@gmail.commailto:rosenvi...@gmail.com]
Sent: Friday, August 14, 2015 10:21 AM
To: dev
Subject: Re: Automatically deleting pull request comments left by AmplabJenkins

Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59

On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen 
rosenvi...@gmail.commailto:rosenvi...@gmail.com wrote:
TL;DR: would anyone object if I wrote a script to auto-delete pull request 
comments from AmplabJenkins?

Currently there are two bots which post Jenkins test result comments to GitHub, 
AmplabJenkins and SparkQA.

SparkQA is the account which post the detailed Jenkins start and finish 
messages that contain information on which commit is being tested and which 
tests have failed. This bot is controlled via the dev/run-tests-jenkins script.

AmplabJenkins is controlled by the Jenkins GitHub Pull Request Builder plugin. 
This bot posts relatively uninformative comments (Merge build triggered, 
Merge build started, Merge build failed) that do not contain any links or 
details specific to the tests being run.

It is technically non-trivial prevent these AmplabJenkins comments from being 
posted in the first place (see 
https://issues.apache.org/jira/browse/SPARK-4216).

However, as a short-term hack I'd like to deploy a script which automatically 
deletes these comments as soon as they're posted, with an exemption carved out 
for the Can an admin approve this patch for testing? messages. This will help 
to significantly de-clutter pull request discussions in the GitHub UI.

If nobody objects, I'd like to deploy this script sometime in the next few days.

(From a technical perspective, my script uses the GitHub REST API and 
AmplabJenkins' own OAuth token to delete the comments.  The final deployment 
environment will most likely be the backend of http://spark-prs.appspot.com).

- Josh




RE: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread Cheng, Hao
Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN.

Currently, for the non-equal join, if the join type is the INNER join, then it 
will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the 
outer joins.

In the BroadcastnestedLoopJoin, the table with smaller estimate size will be 
broadcasted, but if the smaller table is also a huge table, I don’t think Spark 
SQL can handle that right now (OOM).

So, I am not sure how you created the df1 instance, but we’d better to reflect 
the real size for the statistics of it, and let the framework decide what to 
do, hopefully Spark Sql can support the non-equal join for large tables in the 
next release.

Hao

From: gen tang [mailto:gen.tan...@gmail.com]
Sent: Tuesday, August 11, 2015 10:12 PM
To: dev@spark.apache.org
Subject: Potential bug broadcastNestedLoopJoin or default value of 
spark.sql.autoBroadcastJoinThreshold

Hi,

Recently, I use spark sql to do join on non-equality condition, condition1 or 
condition2 for example.

Spark will use broadcastNestedLoopJoin to do this. Assume that one of 
dataframe(df1) is not created from hive table nor local collection and the 
other one is created from hivetable(df2). For df1, spark will use 
defaultSizeInBytes * length of df1 to estimate the size of df1 and use correct 
size for df2.

As the result, in most cases, spark will think df1 is bigger than df2 even df2 
is really huge. And spark will do df2.collect(), which will cause error or 
slowness of program.

Maybe we should just use defaultSizeInBytes for logicalRDD, not 
defaultSizeInBytes * length?

Hope this could be helpful
Thanks a lot in advance for your help and input.

Cheers
Gen



RE: [SparkSQL ] What is Exchange in physical plan for ?

2015-06-08 Thread Cheng, Hao
It means the data shuffling, and its arguments also show the partitioning 
strategy.

-Original Message-
From: invkrh [mailto:inv...@gmail.com] 
Sent: Monday, June 8, 2015 9:34 PM
To: dev@spark.apache.org
Subject: [SparkSQL ] What is Exchange in physical plan for ?

Hi,

DataFrame.explain() shows the physical plan of a query. I noticed there are a 
lot of `Exchange`s in it, like below:

Project [period#20L,categoryName#0,regionName#10,action#15,list_id#16L]
 ShuffledHashJoin [region#18], [regionCode#9], BuildRight
  Exchange (HashPartitioning [region#18], 12)
   Project [categoryName#0,list_id#16L,period#20L,action#15,region#18]
ShuffledHashJoin [refCategoryID#3], [category#17], BuildRight
 Exchange (HashPartitioning [refCategoryID#3], 12)
  Project [categoryName#0,refCategoryID#3]
   PhysicalRDD
[categoryName#0,familyName#1,parentRefCategoryID#2,refCategoryID#3],
MapPartitionsRDD[5] at mapPartitions at SQLContext.scala:439
 Exchange (HashPartitioning [category#17], 12)
  Project [timestamp_sec#13L AS
period#20L,category#17,region#18,action#15,list_id#16L]
   PhysicalRDD
[syslog#12,timestamp_sec#13L,timestamp_usec#14,action#15,list_id#16L,category#17,region#18,expiration_time#19],
MapPartitionsRDD[16] at map at SQLContext.scala:394
  Exchange (HashPartitioning [regionCode#9], 12)
   Project [regionName#10,regionCode#9]
PhysicalRDD
[cityName#4,countryCode#5,countryName#6,dptCode#7,dptName#8,regionCode#9,regionName#10,zipCode#11],
MapPartitionsRDD[11] at mapPartitions at SQLContext.scala:439

I find also its class:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala.

So what does it mean ? 

Thank you.

Hao.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-What-is-Exchange-in-physical-plan-for-tp12659.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-25 Thread Cheng, Hao
Add another Blocker issue, just created! It seems a regression.

https://issues.apache.org/jira/browse/SPARK-7853


-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Monday, May 25, 2015 3:37 PM
To: Patrick Wendell
Cc: dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

We still have 1 blocker for 1.4:

SPARK-6784 Make sure values of partitioning columns are correctly converted 
based on their data types

CC Davies Liu / Adrian Wang to check on the status of this.

There are still 50 Critical issues tagged for 1.4, and 183 issues targeted for 
1.4 in general. Obviously almost all of those won't be in 1.4. How do people 
want to deal with those? The field can be cleared, but do people want to take a 
pass at bumping a few to 1.4.1 that really truly are supposed to go into 1.4.1?


On Sun, May 24, 2015 at 8:22 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.4.0!

 The tag to be voted on is v1.4.0-rc2 (commit 03fb26a3):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=03fb26a
 3e50e00739cc815ba4e2e82d71d003168

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 [published as version: 1.4.0]
 https://repository.apache.org/content/repositories/orgapachespark-1103
 /
 [published as version: 1.4.0-rc2]
 https://repository.apache.org/content/repositories/orgapachespark-1104
 /

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-docs
 /

 Please vote on releasing this package as Apache Spark 1.4.0!

 The vote is open until Wednesday, May 27, at 08:12 UTC and passes if a 
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not 
 release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 == What has changed since RC1 ==
 Below is a list of bug fixes that went into this RC:
 http://s.apache.org/U1M

 == How can I help test this release? == If you are a Spark user, you 
 can help us test this release by taking a Spark 1.3 workload and 
 running on this release candidate, then reporting any regressions.

 == What justifies a -1 vote for this release? == This vote is 
 happening towards the end of the 1.4 QA period, so -1 votes should 
 only occur for significant regressions from 1.3.1.
 Bugs already present in 1.3.X, minor regressions, or bugs related to 
 new features will not block this release.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For 
 additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org



RE: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Cheng, Hao
Thanks for reporting this.

We intend to support the multiple metastore versions in a single 
build(hive-0.13.1) by introducing the IsolatedClientLoader, but probably you’re 
hitting the bug, please file a jira issue for this.

I will keep investigating on this also.

Hao


From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Sunday, May 24, 2015 9:06 PM
To: Cheolsoo Park
Cc: u...@spark.apache.org; dev@spark.apache.org
Subject: Re: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

This discussion belongs on the dev list.  Please post any replies there.

On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park 
piaozhe...@gmail.commailto:piaozhe...@gmail.com wrote:
Hi,

I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to confirm 
whether these are bugs or not before opening a jira.

1) I can no longer compile SparkSQL with -Phive-0.12.0. I noticed that in 1.4, 
IsolatedClientLoader is introduced, and different versions of Hive metastore 
jars can be loaded at runtime. But instead, SparkSQL no longer compiles with 
Hive 0.12.0.

My question is, is this intended? If so, shouldn't the hive-0.12.0 profile in 
POM be removed?

2) After compiling SparkSQL with -Phive-0.13.1, I ran into my 2nd problem. 
Since I have Hive 0.12 metastore in production, I have to use it for now. But 
even if I set spark.sql.hive.metastore.version and 
spark.sql.hive.metastore.jars, SparkSQL cli throws an error as follows-

15/05/24 05:03:29 WARN RetryingMetaStoreClient: MetaStoreClient lost 
connection. Attempting to reconnect.
org.apache.thrift.TApplicationException: Invalid method name: 'get_functions'
at org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_functions(ThriftHiveMetastore.java:2886)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_functions(ThriftHiveMetastore.java:2872)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunctions(HiveMetaStoreClient.java:1727)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy12.getFunctions(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getFunctions(Hive.java:2670)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:674)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionNames(FunctionRegistry.java:662)
at org.apache.hadoop.hive.cli.CliDriver.getCommandCompletor(CliDriver.java:540)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:175)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)

What's happening is that when SparkSQL Cli starts up, it tries to fetch 
permanent udfs from Hive metastore (due to 
HIVE-6330https://issues.apache.org/jira/browse/HIVE-6330, which was 
introduced in Hive 0.13). But then, it ends up invoking an incompatible thrift 
function that doesn't exist in Hive 0.12. To work around this error, I have to 
comment out the following line of code for now-
https://goo.gl/wcfnH1

My question is, is SparkSQL that is compiled against Hive 0.13 supposed to work 
with Hive 0.12 metastore (by setting spark.sql.hive.metastore.version and 
spark.sql.hive.metastore.jars)? It only works if I comment out the above line 
of code.

Thanks,
Cheolsoo



RE: Does Spark SQL (JDBC) support nest select with current version

2015-05-15 Thread Cheng, Hao
Spark SQL just load the query result as a new source (via JDBC), so DO NOT 
confused with the Spark SQL tables. They are totally independent database 
systems.

From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Friday, May 15, 2015 1:59 PM
To: Cheng, Hao; Dev
Subject: Re: Does Spark SQL (JDBC) support nest select with current version

@Hao,
Because the querying joined more than one table, if I register data frame as 
temp table, Spark can't disguise which table is correct. I don't how to set 
dbtable and register temp table.

Any suggestion?


On Friday, May 15, 2015 1:38 PM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:

You need to register the “dataFrame” as a table first and then do queries on 
it? Do you mean that also failed?

From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Friday, May 15, 2015 1:10 PM
To: Yi Zhang; Dev
Subject: Re: Does Spark SQL (JDBC) support nest select with current version

If I pass the whole statement as dbtable to sqlContext.load() method as below:

val query =
  
(select t1._salory as salory,
|t1._name as employeeName,
|(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
|from mock_employees t1
|inner join mock_locations t2
|on t1._location_id = t2._id
|where t1._salory  t2._max_price) EMP
  .stripMargin
val dataFrame = sqlContext.load(jdbc, Map(
  url - url,
  driver - com.mysql.jdbc.Driver,
  dbtable - query
))

It works. However, I can't invoke sql() method to solve this problem. And why?



On Friday, May 15, 2015 11:33 AM, Yi Zhang 
zhangy...@yahoo.com.INVALIDmailto:zhangy...@yahoo.com.INVALID wrote:

The sql statement is like this:
select t1._salory as salory,
t1._name as employeeName,
(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
from mock_employees t1
inner join mock_locations t2
on t1._location_id = t2._id
where t1._salory  t2._max_price

I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in 
predicates - ASF JIRAhttps://issues.apache.org/jira/browse/SPARK-4226 is 
still in the progress. And somebody commented it that Spark 1.3 would support 
it. So I don't know current status for this feature.  Thanks.

Regards,
Yi












[SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF 
JIRAhttps://issues.apache.org/jira/browse/SPARK-4226
java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug 
TOK_INSERT TOK_DE...


View on issues.apache.orghttps://issues.apache.org/jira/browse/SPARK-4226

Preview by Yahoo










RE: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Cheng, Hao
You need to register the “dataFrame” as a table first and then do queries on 
it? Do you mean that also failed?

From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID]
Sent: Friday, May 15, 2015 1:10 PM
To: Yi Zhang; Dev
Subject: Re: Does Spark SQL (JDBC) support nest select with current version

If I pass the whole statement as dbtable to sqlContext.load() method as below:

val query =
  
(select t1._salory as salory,
|t1._name as employeeName,
|(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
|from mock_employees t1
|inner join mock_locations t2
|on t1._location_id = t2._id
|where t1._salory  t2._max_price) EMP
  .stripMargin
val dataFrame = sqlContext.load(jdbc, Map(
  url - url,
  driver - com.mysql.jdbc.Driver,
  dbtable - query
))

It works. However, I can't invoke sql() method to solve this problem. And why?



On Friday, May 15, 2015 11:33 AM, Yi Zhang 
zhangy...@yahoo.com.INVALIDmailto:zhangy...@yahoo.com.INVALID wrote:

The sql statement is like this:
select t1._salory as salory,
t1._name as employeeName,
(select _name from mock_locations t3 where t3._id = t1._location_id ) as 
locationName
from mock_employees t1
inner join mock_locations t2
on t1._location_id = t2._id
where t1._salory  t2._max_price

I noticed the issue [SPARK-4226] SparkSQL - Add support for subqueries in 
predicates - ASF JIRAhttps://issues.apache.org/jira/browse/SPARK-4226 is 
still in the progress. And somebody commented it that Spark 1.3 would support 
it. So I don't know current status for this feature.  Thanks.

Regards,
Yi












[SPARK-4226] SparkSQL - Add support for subqueries in predicates - ASF 
JIRAhttps://issues.apache.org/jira/browse/SPARK-4226
java.lang.RuntimeException: Unsupported language features in query: select 
customerid from sparkbug where customerid in (select customerid from sparkbug 
where customerid in (2,3)) TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME sparkbug 
TOK_INSERT TOK_DE...


View on issues.apache.orghttps://issues.apache.org/jira/browse/SPARK-4226

Preview by Yahoo









RE: Add Char support in SQL dataTypes

2015-03-19 Thread Cheng, Hao
Can you use the Varchar or String instead? Currently, Spark SQL will convert 
the varchar into string type internally(without max length limitation). 
However, char type is not supported yet.

-Original Message-
From: A.M.Chan [mailto:kaka_1...@163.com] 
Sent: Friday, March 20, 2015 9:56 AM
To: spark-dev
Subject: Add Char support in SQL dataTypes

case class PrimitiveData(
charField: Char, // Can't get the char schema info
intField: Int,
longField: Long,
doubleField: Double,
floatField: Float,
shortField: Short,
byteField: Byte,

booleanField: Boolean)
I can't get the schema from case class PrimitiveData.
An error occurred while I use schemaFor[PrimitiveData] Char (of class 
scala.reflect.internal.Types$TypeRef$$anon$6)
scala.MatchError: Char (of class scala.reflect.internal.Types$TypeRef$$anon$6)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:112)





--

kaka1992

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Cheng, Hao
I am not so sure if Hive supports change the metastore after initialized, I 
guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably 
that's why it doesn't work as expected for Q1.

BTW, in most of cases, people configure the metastore settings in 
hive-site.xml, and will not change that since then, is there any reason that 
you want to change that in runtime?

For Q2, probably something wrong in configuration, seems the HDFS run into the 
pseudo/single node mode, can you double check that? Or can you run the DDL 
(like create a table) from the spark shell with HiveContext?

From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Tuesday, March 10, 2015 6:38 PM
To: user; dev@spark.apache.org
Subject: [SparkSQL] Reuse HiveContext to different Hive warehouse?


I'm using Spark 1.3.0 RC3 build with Hive support.



In Spark Shell, I want to reuse the HiveContext instance to different warehouse 
locations. Below are the steps for my test (Assume I have loaded a file into 
table src).



==

15/03/10 18:22:59 INFO SparkILoop: Created sql context (with Hive support)..

SQL context available as sqlContext.

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w)

scala sqlContext.sql(SELECT * from src).saveAsTable(table1)

scala sqlContext.sql(SET hive.metastore.warehouse.dir=/test/w2)

scala sqlContext.sql(SELECT * from src).saveAsTable(table2)

==

After these steps, the tables are stored in /test/w only. I expect table2 
to be stored in /test/w2 folder.



Another question is: if I set hive.metastore.warehouse.dir to a HDFS folder, 
I cannot use saveAsTable()? Is this by design? Exception stack trace is below:

==

15/03/10 18:35:28 INFO BlockManagerMaster: Updated info of block 
broadcast_0_piece0

15/03/10 18:35:28 INFO SparkContext: Created broadcast 0 from broadcast at 
TableReader.scala:74

java.lang.IllegalArgumentException: Wrong FS: 
hdfs://server:8020/space/warehouse/table2, expected: file:///file:///\\

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)

at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:463)

at 
org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:118)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.List.foreach(List.scala:318)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)

at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)

at 
org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:370)

at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)

at 
org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)

at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)

at 
org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:217)

at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)

at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)

at 
org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1048)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:998)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:964)

at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:942)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:20)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:25)

at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:27)

at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)

at $iwC$$iwC$$iwC$$iwC.init(console:31)

at $iwC$$iwC$$iwC.init(console:33)

at $iwC$$iwC.init(console:35)

at $iwC.init(console:37)

at init(console:39)



Thank you very much!




RE: Join implementation in SparkSQL

2015-01-15 Thread Cheng, Hao
Not so sure about your question, but the SparkStrategies.scala and 
Optimizer.scala is a good start if you want to get details of the join 
implementation or optimization.

-Original Message-
From: Andrew Ash [mailto:and...@andrewash.com] 
Sent: Friday, January 16, 2015 4:52 AM
To: Reynold Xin
Cc: Alessandro Baretta; dev@spark.apache.org
Subject: Re: Join implementation in SparkSQL

What Reynold is describing is a performance optimization in implementation, but 
the semantics of the join (cartesian product plus relational algebra
filter) should be the same and produce the same results.

On Thu, Jan 15, 2015 at 1:36 PM, Reynold Xin r...@databricks.com wrote:

 It's a bunch of strategies defined here:

 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/or
 g/apache/spark/sql/execution/SparkStrategies.scala

 In most common use cases (e.g. inner equi join), filters are pushed 
 below the join or into the join. Doing a cartesian product followed by 
 a filter is too expensive.


 On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta 
 alexbare...@gmail.com
 
 wrote:

  Hello,
 
  Where can I find docs about how joins are implemented in SparkSQL? 
  In particular, I'd like to know whether they are implemented 
  according to their relational algebra definition as filters on top 
  of a cartesian product.
 
  Thanks,
 
  Alex
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Cheng, Hao
I am wondering if we can provide more friendly API, other than configuration 
for this purpose. What do you think Patrick?

Cheng Hao

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Thursday, December 25, 2014 3:22 PM
To: Shao, Saisai
Cc: u...@spark.apache.org; dev@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

Is it sufficient to set spark.hadoop.validateOutputSpecs to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi,



 We have such requirements to save RDD output to HDFS with 
 saveAsTextFile like API, but need to overwrite the data if existed. 
 I'm not sure if current Spark support such kind of operations, or I need to 
 check this manually?



 There's a thread in mailing list discussed about this 
 (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
 ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
 I'm not sure this feature is enabled or not, or with some configurations?



 Appreciate your suggestions.



 Thanks a lot

 Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Cheng, Hao
Part of it can be found at:
https://github.com/apache/spark/pull/3429/files#diff-f88c3e731fcb17b1323b778807c35b38R34
 
Sorry it's a TO BE reviewed PR, but still should be informative.

Cheng Hao

-Original Message-
From: Alessandro Baretta [mailto:alexbare...@gmail.com] 
Sent: Friday, December 12, 2014 6:37 AM
To: Michael Armbrust; dev@spark.apache.org
Subject: Where are the docs for the SparkSQL DataTypes?

Michael  other Spark SQL junkies,

As I read through the Spark API docs, in particular those for the 
org.apache.spark.sql package, I can't seem to find details about the Scala 
classes representing the various SparkSQL DataTypes, for instance DecimalType. 
I find DataType classes in org.apache.spark.sql.api.java, but they don't seem 
to match the similarly named scala classes. For instance, DecimalType is 
documented as having a nullary constructor, but if I try to construct an 
instance of org.apache.spark.sql.DecimalType without any parameters, the 
compiler complains about the lack of a precisionInfo field, which I have 
discovered can be passed in as None. Where is all this stuff documented?

Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

2014-12-06 Thread Cheng, Hao
I've created(reused) the PR https://github.com/apache/spark/pull/3336, 
hopefully we can fix this regression.

Thanks for the reporting.

Cheng Hao

-Original Message-
From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Saturday, December 6, 2014 4:51 AM
To: kb
Cc: d...@spark.incubator.apache.org; Cheng Hao
Subject: Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

Thanks for reporting.  This looks like a regression related to:
https://github.com/apache/spark/pull/2570

I've filed it here: https://issues.apache.org/jira/browse/SPARK-4769

On Fri, Dec 5, 2014 at 12:03 PM, kb kend...@hotmail.com wrote:

 I am having trouble getting create table as select or saveAsTable 
 from a hiveContext to work with temp tables in spark 1.2.  No issues 
 in 1.1.0 or
 1.1.1

 Simple modification to test case in the hive SQLQuerySuite.scala:

 test(double nested data) {
 sparkContext.parallelize(Nested1(Nested2(Nested3(1))) ::
 Nil).registerTempTable(nested)
 checkAnswer(
   sql(SELECT f1.f2.f3 FROM nested),
   1)
 checkAnswer(sql(CREATE TABLE test_ctas_1234 AS SELECT * from 
 nested),
 Seq.empty[Row])
 checkAnswer(
   sql(SELECT * FROM test_ctas_1234),
   sql(SELECT * FROM nested).collect().toSeq)
   }


 output:

 11:57:15.974 ERROR org.apache.hadoop.hive.ql.parse.SemanticAnalyzer:
 org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:45 Table not 
 found 'nested'
 at

 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243)
 at

 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1192)
 at

 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9209)
 at

 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70)
 at

 org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89)
 at

 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:105)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:103)
 at

 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply$mcV$sp(SQLQuerySuite.scala:122)
 at

 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
 at

 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
 at

 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 at org.scalatest.Transformer.apply(Transformer.scala:22)
 at org.scalatest.Transformer.apply(Transformer.scala:20)
 at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
 at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
 at

 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 at
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 at
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
 at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
 at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
 at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
 at

 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 at

 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 at

 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
 at

 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 at
 org.scalatest.SuperEngine.org
 $scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
 at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
 at
 org.scalatest.FunSuiteLike

RE: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Cheng, Hao
+1, that definitely will speeds up the PR reviewing / merging.

-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com] 
Sent: Thursday, November 6, 2014 12:46 PM
To: dev
Subject: Re: [VOTE] Designating maintainers for some Spark components

+1 since this is already the de facto model we are using.

On Thu, Nov 6, 2014 at 12:40 PM, Wangfei (X) wangf...@huawei.com wrote:

 +1

 发自我的 iPhone

  在 2014年11月5日,20:06,Denny Lee denny.g@gmail.com 写道:
 
  +1 great idea.
  On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng men...@gmail.com wrote:
 
  +1 (binding)
 
  On Wed, Nov 5, 2014 at 7:52 PM, Mark Hamstra 
  m...@clearstorydata.com
  wrote:
  +1 (binding)
 
  On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com
  wrote:
 
  +1 on this proposal.
 
  On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com
 wrote:
 
  Will these maintainers have a cleanup for those pending PRs upon 
  we
  start
  to apply this model?
 
 
  I second Nan's question. I would like to see this initiative 
  drive a reduction in the number of stale PRs we have out there. 
  We're
  approaching
  300 open PRs again.
 
  Nick
 
  ---
  -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For 
  additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For 
 additional commands, e-mail: dev-h...@spark.apache.org




RE: Build with Hive 0.13.1 doesn't have datanucleus and parquet dependencies.

2014-10-27 Thread Cheng, Hao
Hive-thriftserver module is not included while specifying the profile 
hive-0.13.1. 

-Original Message-
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] 
Sent: Monday, October 27, 2014 4:48 PM
To: dev@spark.apache.org
Subject: Build with Hive 0.13.1 doesn't have datanucleus and parquet 
dependencies.

There's a change in build process lately for Hive 0.13 support and we should 
make it obvious. Based on the new pom.xml I tried to enable Hive
0.13.1 support by using option

  -Phive-0.13.1

However, it seems datanucleus and parquet dependencies are not available in the 
final build.

Am I missing anything?

Jianshi

--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-31 Thread Cheng, Hao
Yes, the root cause for that is the output ObjectInspector in SerDe 
implementation doesn't reflect the real typeinfo.

Hive actually provides the API like 
TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(TypeInfo) for the 
mapping.

You probably need to update the code at 
https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L60.

-Original Message-
From: chutium [mailto:teng@gmail.com] 
Sent: Monday, September 01, 2014 2:58 AM
To: d...@spark.incubator.apache.org
Subject: Re: HiveContext, schemaRDD.printSchema get different dataTypes, 
feature or a bug? really strange and surprised...

Hi Cheng, thank you very much for helping me to finally find out the secret of 
this magic...

actually we defined this external table with
SID STRING
REQUEST_ID STRING
TIMES_DQ TIMESTAMP
TOTAL_PRICE FLOAT
...

using desc table ext_fullorders it is only shown as
[# col_name data_type   comment ]
...
[times_dq   string  from deserializer   ]
[total_pricestring  from deserializer   ]
...
because, as you said, CSVSerde sets all field object inspectors to 
javaStringObjectInspector and therefore there are comments from deserializer

but in StorageDescriptor, are the real user defined types, using desc extended 
table ext_fullorders we can see his sd:StorageDescriptor
is:
FieldSchema(name:times_dq, type:timestamp, comment:null), 
FieldSchema(name:total_price, type:float, comment:null)

and Spark HiveContext reads the schema info from this StorageDescriptor
https://github.com/apache/spark/blob/7e191fe29bb09a8560cd75d453c4f7f662dff406/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L316

so, in the SchemaRDD, the fields in Row were filled with strings (via 
fillObject, all of values were retrieved from CSVSerDe with
javaStringObjectInspector)

but Spark considers that some of them are float or timestamp (schema info were 
got from sd:StorageDescriptor)

crazy...

and sorry for update on the weekend...

a little more about how i fand this problem and why it is a trouble for us.

we use the new spark thrift server, to query normal managed hive table, it 
works fine

but when we try to access the external tables with custom SerDe such as this 
CSVSerDe, then we will get this ClassCastException, such as:
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float

the reason is
https://github.com/apache/spark/blob/d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala#L104-L105

here Spark's thrift server try to get a float value from SparkRow, because in 
the schema info (sd:StorageDescriptor) this column is float, but actually in 
SparkRow, this field was filled with string value...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8157.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [sql]enable spark sql cli support spark sql

2014-08-15 Thread Cheng, Hao
If so, probably we need to add the SQL dialects switching support for 
SparkSQLCLI, as Fei suggested. What do you think the priority for this?

-Original Message-
From: Cheng Lian [mailto:lian.cs@gmail.com] 
Sent: Friday, August 15, 2014 1:57 PM
To: Cheng, Hao
Cc: scwf; dev@spark.apache.org
Subject: Re: [sql]enable spark sql cli support spark sql

In the long run, as Michael suggested in his Spark Summit 14 talk, we'd like to 
implement SQL-92, maybe with the help of Optiq.

On Aug 15, 2014, at 1:13 PM, Cheng, Hao hao.ch...@intel.com wrote:

 Actually the SQL Parser (another SQL dialect in SparkSQL) is quite weak, and 
 only support some basic queries, not sure what's the plan for its enhancement.
 
 -Original Message-
 From: scwf [mailto:wangf...@huawei.com]
 Sent: Friday, August 15, 2014 11:22 AM
 To: dev@spark.apache.org
 Subject: [sql]enable spark sql cli support spark sql
 
 hi all,
   now spark sql cli only support spark hql, i think we can enable this cli to 
 support spark sql, do you think it's necessary?
 
 --
 
 Best Regards
 Fei Wang
 
 --
 --
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For 
 additional commands, e-mail: dev-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For 
 additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: [sql]enable spark sql cli support spark sql

2014-08-14 Thread Cheng, Hao
Actually the SQL Parser (another SQL dialect in SparkSQL) is quite weak, and 
only support some basic queries, not sure what's the plan for its enhancement.

-Original Message-
From: scwf [mailto:wangf...@huawei.com] 
Sent: Friday, August 15, 2014 11:22 AM
To: dev@spark.apache.org
Subject: [sql]enable spark sql cli support spark sql

hi all,
   now spark sql cli only support spark hql, i think we can enable this cli to 
support spark sql, do you think it's necessary?

-- 

Best Regards
Fei Wang





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org