Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Sameer Agarwal
Congratulations Zhenhua -- well deserved!

On 1 April 2018 at 22:36, sujith chacko  wrote:

> Congratulations zhenhua for this great achievement.
>
> On Mon, 2 Apr 2018 at 11:05 AM, Denny Lee  wrote:
>
>> Awesome - congrats Zhenhua!
>>
>> On Sun, Apr 1, 2018 at 10:33 PM 叶先进  wrote:
>>
>>> Big congs.
>>>
>>> > On Apr 2, 2018, at 1:28 PM, Wenchen Fan  wrote:
>>> >
>>> > Hi all,
>>> >
>>> > The Spark PMC recently added Zhenhua Wang as a committer on the
>>> project. Zhenhua is the major contributor of the CBO project, and has been
>>> contributing across several areas of Spark for a while, focusing especially
>>> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>>> >
>>> > Wenchen
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


-- 
Sameer Agarwal
Computer Science | UC Berkeley
http://cs.berkeley.edu/~sameerag


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Kazuaki Ishizaki
Congratulations to Zhenhua!

Kazuaki Ishizaki



From:   sujith chacko 
To: Denny Lee 
Cc: Spark dev list , Wenchen Fan 
, "叶先进" 
Date:   2018/04/02 14:37
Subject:Re: Welcome Zhenhua Wang as a Spark committer



Congratulations zhenhua for this great achievement.

On Mon, 2 Apr 2018 at 11:05 AM, Denny Lee  wrote:
Awesome - congrats Zhenhua! 

On Sun, Apr 1, 2018 at 10:33 PM 叶先进  wrote:
Big congs.

> On Apr 2, 2018, at 1:28 PM, Wenchen Fan  wrote:
>
> Hi all,
>
> The Spark PMC recently added Zhenhua Wang as a committer on the project. 
Zhenhua is the major contributor of the CBO project, and has been 
contributing across several areas of Spark for a while, focusing 
especially on analyzer, optimizer in Spark SQL. Please join me in 
welcoming Zhenhua!
>
> Wenchen


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread sujith chacko
Congratulations zhenhua for this great achievement.

On Mon, 2 Apr 2018 at 11:05 AM, Denny Lee  wrote:

> Awesome - congrats Zhenhua!
>
> On Sun, Apr 1, 2018 at 10:33 PM 叶先进  wrote:
>
>> Big congs.
>>
>> > On Apr 2, 2018, at 1:28 PM, Wenchen Fan  wrote:
>> >
>> > Hi all,
>> >
>> > The Spark PMC recently added Zhenhua Wang as a committer on the
>> project. Zhenhua is the major contributor of the CBO project, and has been
>> contributing across several areas of Spark for a while, focusing especially
>> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>> >
>> > Wenchen
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Denny Lee
Awesome - congrats Zhenhua!

On Sun, Apr 1, 2018 at 10:33 PM 叶先进  wrote:

> Big congs.
>
> > On Apr 2, 2018, at 1:28 PM, Wenchen Fan  wrote:
> >
> > Hi all,
> >
> > The Spark PMC recently added Zhenhua Wang as a committer on the project.
> Zhenhua is the major contributor of the CBO project, and has been
> contributing across several areas of Spark for a while, focusing especially
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
> >
> > Wenchen
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread zhenya Sun
congratulations

> 在 2018年4月2日,下午1:30,Hyukjin Kwon  写道:
> 
> Congratuation, Zhenhua Wang! Very well deserved.
> 
> 2018-04-02 13:28 GMT+08:00 Wenchen Fan  >:
> Hi all,
> 
> The Spark PMC recently added Zhenhua Wang as a committer on the project. 
> Zhenhua is the major contributor of the CBO project, and has been 
> contributing across several areas of Spark for a while, focusing especially 
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
> 
> Wenchen
> 



Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread 叶先进
Big congs.

> On Apr 2, 2018, at 1:28 PM, Wenchen Fan  wrote:
> 
> Hi all,
> 
> The Spark PMC recently added Zhenhua Wang as a committer on the project. 
> Zhenhua is the major contributor of the CBO project, and has been 
> contributing across several areas of Spark for a while, focusing especially 
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
> 
> Wenchen


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Hyukjin Kwon
Congratuation, Zhenhua Wang! Very well deserved.

2018-04-02 13:28 GMT+08:00 Wenchen Fan :

> Hi all,
>
> The Spark PMC recently added Zhenhua Wang as a committer on the project.
> Zhenhua is the major contributor of the CBO project, and has been
> contributing across several areas of Spark for a while, focusing especially
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>
> Wenchen
>


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Xingbo Jiang
congs & welcome!

2018-04-02 13:28 GMT+08:00 Wenchen Fan :

> Hi all,
>
> The Spark PMC recently added Zhenhua Wang as a committer on the project.
> Zhenhua is the major contributor of the CBO project, and has been
> contributing across several areas of Spark for a while, focusing especially
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>
> Wenchen
>


Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Wenchen Fan
Hi all,

The Spark PMC recently added Zhenhua Wang as a committer on the project.
Zhenhua is the major contributor of the CBO project, and has been
contributing across several areas of Spark for a while, focusing especially
on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!

Wenchen


Embedded derby driver missing from 2.2.1 onwards

2018-04-01 Thread geoHeil
Hi,

I noticed that spark standalone (locally for development) will no longer
support the integrated hive megastore as some driver classes for derby seem
to be missing from 2.2.1 and onwards (2.3.0). It works just fine for 2.2.0
or previous versions to execute the following script:

spark.sql("CREATE database foobar")
The exception I see for newer versions of spark is:
NoClassDefFoundError: Could not initialize class
org.apache.derby.jdbc.EmbeddedDriver
Simply adding derby as a dependency in SBT did not solve this issue for me.

Best,
Georg



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: DataSourceV2 write input requirements

2018-04-01 Thread Patrick Woody
Yep, that sounds reasonable to me!

On Fri, Mar 30, 2018 at 5:50 PM, Ted Yu  wrote:

> +1
>
>  Original message 
> From: Ryan Blue 
> Date: 3/30/18 2:28 PM (GMT-08:00)
> To: Patrick Woody 
> Cc: Russell Spitzer , Wenchen Fan <
> cloud0...@gmail.com>, Ted Yu , Spark Dev List <
> dev@spark.apache.org>
> Subject: Re: DataSourceV2 write input requirements
>
> You're right. A global sort would change the clustering if it had more
> fields than the clustering.
>
> Then what about this: if there is no RequiredClustering, then the sort is
> a global sort. If RequiredClustering is present, then the clustering is
> applied and the sort is a partition-level sort.
>
> That rule would mean that within a partition you always get the sort, but
> an explicit clustering overrides the partitioning a sort might try to
> introduce. Does that sound reasonable?
>
> rb
>
> On Fri, Mar 30, 2018 at 12:39 PM, Patrick Woody 
> wrote:
>
>> Does that methodology work in this specific case? The ordering must be a
>> subset of the clustering to guarantee they exist in the same partition when
>> doing a global sort I thought. Though I get the gist that if it does
>> satisfy, then there is no reason to not choose the global sort.
>>
>> On Fri, Mar 30, 2018 at 1:31 PM, Ryan Blue  wrote:
>>
>>> > Can you expand on how the ordering containing the clustering
>>> expressions would ensure the global sort?
>>>
>>> The idea was to basically assume that if the clustering can be satisfied
>>> by a global sort, then do the global sort. For example, if the clustering
>>> is Set("b", "a") and the sort is Seq("a", "b", "c") then do a global sort
>>> by columns a, b, and c.
>>>
>>> Technically, you could do this with a hash partitioner instead of a
>>> range partitioner and sort within each partition, but that doesn't make
>>> much sense because the partitioning would ensure that each partition has
>>> just one combination of the required clustering columns. Using a hash
>>> partitioner would make it so that the in-partition sort basically ignores
>>> the first few values, so it must be that the intent was a global sort.
>>>
>>> On Fri, Mar 30, 2018 at 6:51 AM, Patrick Woody >> > wrote:
>>>
 Right, you could use this to store a global ordering if there is only
> one write (e.g., CTAS). I don’t think anything needs to change in that
> case, you would still have a clustering and an ordering, but the ordering
> would need to include all fields of the clustering. A way to pass in the
> partition ordinal for the source to store would be required.


 Can you expand on how the ordering containing the clustering
 expressions would ensure the global sort? Having an RangePartitioning would
 certainly satisfy, but it isn't required - is the suggestion that if Spark
 sees this overlap, then it plans a global sort?

 On Thu, Mar 29, 2018 at 12:16 PM, Russell Spitzer <
 russell.spit...@gmail.com> wrote:

> @RyanBlue I'm hoping that through the CBO effort we will continue to
> get more detailed statistics. Like on read we could be using sketch data
> structures to get estimates on unique values and density for each column.
> You may be right that the real way for this to be handled would be giving 
> a
> "cost" back to a higher order optimizer which can decide which method to
> use rather than having the data source itself do it. This is probably in a
> far future version of the api.
>
> On Thu, Mar 29, 2018 at 9:10 AM Ryan Blue  wrote:
>
>> Cassandra can insert records with the same partition-key faster if
>> they arrive in the same payload. But this is only beneficial if the
>> incoming dataset has multiple entries for the same partition key.
>>
>> Thanks for the example, the recommended partitioning use case makes
>> more sense now. I think we could have two interfaces, a
>> RequiresClustering and a RecommendsClustering if we want to support
>> this. But I’m skeptical it will be useful for two reasons:
>>
>>- Do we want to optimize the low cardinality case? Shuffles are
>>usually much cheaper at smaller sizes, so I’m not sure it is 
>> necessary to
>>optimize this away.
>>- How do we know there isn’t just a few partition keys for all
>>the records? It may look like a shuffle wouldn’t help, but we don’t 
>> know
>>the partition keys until it is too late.
>>
>> Then there’s also the logic for avoiding the shuffle and how to
>> calculate the cost, which sounds like something that needs some details
>> from CBO.
>>
>> I would assume that given the estimated data size from Spark and
>> options passed in from the user, the data 

Re: Re: the issue about the + in column,can we support the string please?

2018-04-01 Thread 1427357...@qq.com
Hi , 
I checked the code.
It seems it is hard to change the code.
Current code, string + int is translated to double + double.
If I change the the string + int to string + sting, it will incompatible whit 
old version.

Does anyone have better idea about this issue please?



1427357...@qq.com
 
From: Shmuel Blitz
Date: 2018-03-26 17:17
To: 1427357...@qq.com
CC: spark users; dev
Subject: Re: Re: the issue about the + in column,can we support the string 
please?
I agree.

Just pointed out the option, in case you missed it.

Cheers,
Shmuel

On Mon, Mar 26, 2018 at 10:57 AM, 1427357...@qq.com <1427357...@qq.com> wrote:
Hi,

Using concat is one of the way.
But the + is more intuitive and easy to understand.



1427357...@qq.com
 
From: Shmuel Blitz
Date: 2018-03-26 15:31
To: 1427357...@qq.com
CC: spark?users; dev
Subject: Re: the issue about the + in column,can we support the string please?
Hi,

you can get the same with:

import org.apache.spark.sql.functions._
import sqlContext.implicits._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType} 

val schema = StructType(Array(StructField("name", StringType),
StructField("age", IntegerType) ))

val lst = List(Row("Shmuel", 13), Row("Blitz", 23))
val rdd = sc.parallelize(lst)

val df = sqlContext.createDataFrame(rdd,schema)

df.withColumn("newName", concat($"name" ,  lit("abc"))  ).show()

On Mon, Mar 26, 2018 at 6:36 AM, 1427357...@qq.com <1427357...@qq.com> wrote:
Hi  all,

I have a table like below:

+---+-+---+
| id| name|sharding_id|
+---+-+---+
|  1|leader us|  1|
|  3|mycat|  1|
+---+-+---+

My schema is :
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- sharding_id: integer (nullable = false)

I want add a new column named newName. The new column is based on "name" and 
append "abc" after it. My code looks like:

stud_scoreDF.withColumn("newName", stud_scoreDF.col("name") +  "abc"  ).show()
When I run the code, I got the reslult:
+---+-+---+---+
| id| name|sharding_id|newName|
+---+-+---+---+
|  1|leader us|  1|   null|
|  3|mycat|  1|   null|
+---+-+---+---+


I checked the code, the key code is  in arithmetic.scala. line 165.
It looks like:

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = dataType 
match {
  case dt: DecimalType =>
defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.$$plus($eval2)")
  case ByteType | ShortType =>
defineCodeGen(ctx, ev,
  (eval1, eval2) => s"(${ctx.javaType(dataType)})($eval1 $symbol $eval2)")
  case CalendarIntervalType =>
defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.add($eval2)")
  case _ =>
defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1 $symbol $eval2")
}

My issue is:
Can we add case StringType in this class to support string append please?





1427357...@qq.com



-- 
Shmuel Blitz 
Big Data Developer 
Email: shmuel.bl...@similarweb.com 
www.similarweb.com 



-- 
Shmuel Blitz 
Big Data Developer 
Email: shmuel.bl...@similarweb.com 
www.similarweb.com