Fwd: [SparkSQL] Project using NamedExpression

2017-03-21 Thread Aviral Agarwal
Hi guys,

I want transform Row using NamedExpression.

Below is the code snipped that I am using :


def apply(dataFrame: DataFrame, selectExpressions:
java.util.List[String]): RDD[UnsafeRow] = {

val exprArray = selectExpressions.map(s =>
  Column(SqlParser.parseExpression(s)).named
)

val inputSchema = dataFrame.logicalPlan.output

val transformedRDD = dataFrame.mapPartitions(
  iter => {
val project = UnsafeProjection.create(exprArray,inputSchema)
iter.map{
  row =>
project(InternalRow.fromSeq(row.toSeq))
}
})

transformedRDD
  }


The problem is that expression becomes unevaluable :

Caused by: java.lang.UnsupportedOperationException: Cannot evaluate
expression: 'a
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.
genCode(Expression.scala:233)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.g
enCode(unresolved.scala:53)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfu
n$gen$2.apply(Expression.scala:106)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfu
n$gen$2.apply(Expression.scala:102)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.expressions.Expression.gen(Exp
ression.scala:102)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
text$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
text$$anonfun$generateExpressions$1.apply(CodeGenerator.scala:464)
at scala.collection.TraversableLike$$anonfun$map$1.apply(
TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(
TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(Resiza
bleArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.
scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenCon
text.generateExpressions(CodeGenerator.scala:464)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
safeProjection$.createCode(GenerateUnsafeProjection.scala:281)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
safeProjection$.create(GenerateUnsafeProjection.scala:324)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
safeProjection$.create(GenerateUnsafeProjection.scala:317)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUn
safeProjection$.create(GenerateUnsafeProjection.scala:32)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenera
tor.generate(CodeGenerator.scala:635)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.
create(Projection.scala:125)
at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.
create(Projection.scala:135)
at org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(
ScalaTransform.scala:31)
at org.apache.spark.sql.ScalaTransform$$anonfun$3.apply(
ScalaTransform.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
DD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.sca
la:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


This might be because the Expression is unresolved.

Any help would be appreciated.

Thanks and Regards,
Aviral Agarwal


Re: subscribe to spark dev list

2017-03-21 Thread Yash Sharma
Sorry for the spam, used the wrong email address.

On Wed, 22 Mar 2017 at 12:01 Yash Sharma  wrote:

> subscribe to spark dev list
>


subscribe to spark dev list

2017-03-21 Thread Yash Sharma
subscribe to spark dev list


Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Nick Pentreath
As for SPARK-19759 , I
don't think that needs to be targeted for 2.1.1 so we don't need to worry
about it

On Tue, 21 Mar 2017 at 13:49 Holden Karau  wrote:

> I agree with Michael, I think we've got some outstanding issues but none
> of them seem like regression from 2.1 so we should be good to start the RC
> process.
>
> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust 
> wrote:
>
> Please speak up if I'm wrong, but none of these seem like critical
> regressions from 2.1.  As such I'll start the RC process later today.
>
> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
> wrote:
>
> I'm not super sure it should be a blocker for 2.1.1 -- is it a regression?
> Maybe we can get TDs input on it?
>
> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>
> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
> blocker
>
> Best,
>
> Nan
>
> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
> wrote:
>
> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522  - 
> --executor-memory
> flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell -
> https://github.com/apache/spark/pull/16906 PR exists but its difficult to
> add automated tests for this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - I

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Holden Karau
I agree with Michael, I think we've got some outstanding issues but none of
them seem like regression from 2.1 so we should be good to start the RC
process.

On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust 
wrote:

> Please speak up if I'm wrong, but none of these seem like critical
> regressions from 2.1.  As such I'll start the RC process later today.
>
> On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau 
> wrote:
>
>> I'm not super sure it should be a blocker for 2.1.1 -- is it a
>> regression? Maybe we can get TDs input on it?
>>
>> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>>
>>> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
>>> blocker
>>>
>>> Best,
>>>
>>> Nan
>>>
>>> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung >> > wrote:
>>>
>>> I've been scrubbing R and think we are tracking 2 issues
>>>
>>> https://issues.apache.org/jira/browse/SPARK-19237
>>>
>>> https://issues.apache.org/jira/browse/SPARK-19925
>>>
>>>
>>>
>>>
>>> --
>>> *From:* holden.ka...@gmail.com  on behalf of
>>> Holden Karau 
>>> *Sent:* Monday, March 20, 2017 3:12:35 PM
>>> *To:* dev@spark.apache.org
>>> *Subject:* Outstanding Spark 2.1.1 issues
>>>
>>> Hi Spark Developers!
>>>
>>> As we start working on the Spark 2.1.1 release I've been looking at our
>>> outstanding issues still targeted for it. I've tried to break it down by
>>> component so that people in charge of each component can take a quick look
>>> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
>>> the overall list is pretty short (only 9 items - 5 if we only look at
>>> explicitly tagged) :)
>>>
>>> If your working on something for Spark 2.1.1 and it doesn't show up in
>>> this list please speak up now :) We have a lot of issues (including "in
>>> progress") that are listed as impacting 2.1.0, but they aren't targeted for
>>> 2.1.1 - if there is something you are working in their which should be
>>> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>>>
>>> The query string I used for looking at the 2.1.1 open issues is:
>>>
>>> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion =
>>> 2.1.1 OR cf[12310320] = "2.1.1") AND project = spark AND resolution =
>>> Unresolved ORDER BY priority DESC
>>>
>>> None of the open issues appear to be a regression from 2.1.0, but those
>>> seem more likely to show up during the RC process (thanks in advance to
>>> everyone testing their workloads :)) & generally none of them seem to be
>>>
>>> (Note: the cfs are for Target Version/s field)
>>>
>>> Critical Issues:
>>>  SQL:
>>>   SPARK-19690  - Join
>>> a streaming DataFrame with a batch DataFrame may not work - PR
>>> https://github.com/apache/spark/pull/17052 (review in progress by
>>> zsxwing, currently failing Jenkins)*
>>>
>>> Major Issues:
>>>  SQL:
>>>   SPARK-19035  - rand()
>>> function in case when cause failed - no outstanding PR (consensus on JIRA
>>> seems to be leaning towards it being a real issue but not necessarily
>>> everyone agrees just yet - maybe we should slip this?)*
>>>  Deploy:
>>>   SPARK-19522 
>>>  - --executor-memory flag doesn't work in local-cluster mode -
>>> https://github.com/apache/spark/pull/16975 (review in progress by
>>> vanzin, but PR currently stalled waiting on response) *
>>>  Core:
>>>   SPARK-20025  - Driver
>>> fail over will not work, if SPARK_LOCAL* env is set. -
>>> https://github.com/apache/spark/pull/17357 (waiting on review) *
>>>  PySpark:
>>>  SPARK-19955  -
>>> Update run-tests to support conda [ Part of Dropping 2.6 support -- which
>>> we shouldn't do in a minor release -- but also fixes pip installability
>>> tests to run in Jenkins ]-  PR failing Jenkins (I need to poke this some
>>> more, but seems like 2.7 support works but some other issues. Maybe slip to
>>> 2.2?)
>>>
>>> Minor issues:
>>>  Tests:
>>>   SPARK-19612  - Tests
>>> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
>>> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
>>> consider explicitly targeting this for 2.2?
>>>  PySpark:
>>>   SPARK-19570  - Allow
>>> to disable hive in pyspark shell - https://github.com/apache/sp
>>> ark/pull/16906 PR exists but its difficult to add automated tests for
>>> this (although if SPARK-19955
>>>  gets in would make
>>> testing this easier) - no reviewers yet. Possible re-target?*
>>>  Structured Streaming:
>>>   SPARK-19613  - Flaky
>>> test: StateStoreRDDSuite.versioning and immutability - It's not targetted

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Michael Armbrust
Please speak up if I'm wrong, but none of these seem like critical
regressions from 2.1.  As such I'll start the RC process later today.

On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau  wrote:

> I'm not super sure it should be a blocker for 2.1.1 -- is it a regression?
> Maybe we can get TDs input on it?
>
> On Mon, Mar 20, 2017 at 8:48 PM Nan Zhu  wrote:
>
>> I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
>> blocker
>>
>> Best,
>>
>> Nan
>>
>> On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
>> wrote:
>>
>> I've been scrubbing R and think we are tracking 2 issues
>>
>> https://issues.apache.org/jira/browse/SPARK-19237
>>
>> https://issues.apache.org/jira/browse/SPARK-19925
>>
>>
>>
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Monday, March 20, 2017 3:12:35 PM
>> *To:* dev@spark.apache.org
>> *Subject:* Outstanding Spark 2.1.1 issues
>>
>> Hi Spark Developers!
>>
>> As we start working on the Spark 2.1.1 release I've been looking at our
>> outstanding issues still targeted for it. I've tried to break it down by
>> component so that people in charge of each component can take a quick look
>> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
>> the overall list is pretty short (only 9 items - 5 if we only look at
>> explicitly tagged) :)
>>
>> If your working on something for Spark 2.1.1 and it doesn't show up in
>> this list please speak up now :) We have a lot of issues (including "in
>> progress") that are listed as impacting 2.1.0, but they aren't targeted for
>> 2.1.1 - if there is something you are working in their which should be
>> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>>
>> The query string I used for looking at the 2.1.1 open issues is:
>>
>> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion =
>> 2.1.1 OR cf[12310320] = "2.1.1") AND project = spark AND resolution =
>> Unresolved ORDER BY priority DESC
>>
>> None of the open issues appear to be a regression from 2.1.0, but those
>> seem more likely to show up during the RC process (thanks in advance to
>> everyone testing their workloads :)) & generally none of them seem to be
>>
>> (Note: the cfs are for Target Version/s field)
>>
>> Critical Issues:
>>  SQL:
>>   SPARK-19690  - Join
>> a streaming DataFrame with a batch DataFrame may not work - PR
>> https://github.com/apache/spark/pull/17052 (review in progress by
>> zsxwing, currently failing Jenkins)*
>>
>> Major Issues:
>>  SQL:
>>   SPARK-19035  - rand()
>> function in case when cause failed - no outstanding PR (consensus on JIRA
>> seems to be leaning towards it being a real issue but not necessarily
>> everyone agrees just yet - maybe we should slip this?)*
>>  Deploy:
>>   SPARK-19522 
>>  - --executor-memory flag doesn't work in local-cluster mode -
>> https://github.com/apache/spark/pull/16975 (review in progress by
>> vanzin, but PR currently stalled waiting on response) *
>>  Core:
>>   SPARK-20025  - Driver
>> fail over will not work, if SPARK_LOCAL* env is set. -
>> https://github.com/apache/spark/pull/17357 (waiting on review) *
>>  PySpark:
>>  SPARK-19955  -
>> Update run-tests to support conda [ Part of Dropping 2.6 support -- which
>> we shouldn't do in a minor release -- but also fixes pip installability
>> tests to run in Jenkins ]-  PR failing Jenkins (I need to poke this some
>> more, but seems like 2.7 support works but some other issues. Maybe slip to
>> 2.2?)
>>
>> Minor issues:
>>  Tests:
>>   SPARK-19612  - Tests
>> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
>> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
>> consider explicitly targeting this for 2.2?
>>  PySpark:
>>   SPARK-19570  - Allow
>> to disable hive in pyspark shell - https://github.com/apache/
>> spark/pull/16906 PR exists but its difficult to add automated tests for
>> this (although if SPARK-19955
>>  gets in would make
>> testing this easier) - no reviewers yet. Possible re-target?*
>>  Structured Streaming:
>>   SPARK-19613  - Flaky
>> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
>> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
>> this for 2.2?
>>  ML:
>>   SPARK-19759 
>>  - ALSModel.predict on Dataframes : potential optimization by not using
>> blas - No PR consider re-targeting unless someone has a PR waiting in the
>> wi

Re: Why are DataFrames always read with nullable=True?

2017-03-21 Thread Jason White
Thanks for pointing to those JIRA tickets, I hadn't seen them. Encouraging
that they are recent. I hope we can find a solution there.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207p21218.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-21 Thread Dongjin Lee
Hi Chetan,

Sadly, you can not; Spark is configured to ignore the null values when
writing JSON. (check JacksonMessageWriter and find
JsonInclude.Include.NON_NULL from the code.) If you want that
functionality, it would be much better to file the problem to JIRA.

Best,
Dongjin

On Mon, Mar 20, 2017 at 4:44 PM, Chetan Khatri 
wrote:

> Exactly.
>
> On Sat, Mar 11, 2017 at 1:35 PM, Dongjin Lee  wrote:
>
>> Hello Chetan,
>>
>> Could you post some code? If I understood correctly, you are trying to
>> save JSON like:
>>
>> {
>>   "first_name": "Dongjin",
>>   "last_name: null
>> }
>>
>> not in omitted form, like:
>>
>> {
>>   "first_name": "Dongjin"
>> }
>>
>> right?
>>
>> - Dongjin
>>
>> On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Dev / Users,
>>>
>>> I am working with PySpark Code migration to scala, with Python -
>>> Iterating Spark with dictionary and generating JSON with null is possible
>>> with json.dumps() which will be converted to SparkSQL[Row] but in scala how
>>> can we generate json will null values as a Dataframe ?
>>>
>>> Thanks.
>>>
>>
>>
>>
>> --
>> *Dongjin Lee*
>>
>>
>> *Software developer in Line+.So interested in massive-scale machine
>> learning.facebook: www.facebook.com/dongjin.lee.kr
>> linkedin: 
>> kr.linkedin.com/in/dongjinleekr
>> github:
>> github.com/dongjinleekr
>> twitter: www.twitter.com/dongjinleekr
>> *
>>
>
>


-- 
*Dongjin Lee*


*Software developer in Line+.So interested in massive-scale machine
learning.facebook: www.facebook.com/dongjin.lee.kr
linkedin:
kr.linkedin.com/in/dongjinleekr
github:
github.com/dongjinleekr
twitter: www.twitter.com/dongjinleekr
*