Register catalyst expression as SQL DSL

2018-07-09 Thread geoHeil
Hi,

I would like to register custom catalyst expressions as SQL DSL
https://stackoverflow.com/questions/51199761/spark-register-expression-for-sql-dsl
can someone shed some light here? The documentation does not seem to contain
a lot of information regarding catalyst internals.

Thanks a lot.
Georg



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Embedded derby driver missing from 2.2.1 onwards

2018-04-01 Thread geoHeil
Hi,

I noticed that spark standalone (locally for development) will no longer
support the integrated hive megastore as some driver classes for derby seem
to be missing from 2.2.1 and onwards (2.3.0). It works just fine for 2.2.0
or previous versions to execute the following script:

spark.sql("CREATE database foobar")
The exception I see for newer versions of spark is:
NoClassDefFoundError: Could not initialize class
org.apache.derby.jdbc.EmbeddedDriver
Simply adding derby as a dependency in SBT did not solve this issue for me.

Best,
Georg



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why did spark switch from AKKA to net / ...

2017-05-07 Thread geoHeil
Thanks for the clarification.
Matei Zaharia [via Apache Spark Developers List] <
ml+s1001551n21526...@n3.nabble.com> schrieb am Mo. 8. Mai 2017 um 03:51:

> More specifically, many user applications that link to Spark also linked
> to Akka as a library (e.g. say you want to write a service that receives
> requests from Akka and runs them on Spark). In that case, you'd have two
> conflicting versions of the Akka library in the same JVM.
>
> Matei
>
> On May 7, 2017, at 2:24 PM, Mark Hamstra <[hidden email]
> <http:///user/SendEmail.jtp?type=node=21526=0>> wrote:
>
> The point is that Spark's prior usage of Akka was limited enough that it
> could fairly easily be removed entirely instead of forcing particular
> architectural decisions on Spark's users.
>
>
> On Sun, May 7, 2017 at 1:14 PM, geoHeil <[hidden email]
> <http:///user/SendEmail.jtp?type=node=21526=1>> wrote:
>
>> Thank you!
>> In the issue they outline that hard wired dependencies were the problem.
>> But wouldn't one want to not directly accept the messages from an actor
>> but have Kafka as an failsafe intermediary?
>>
>> zero323 [via Apache Spark Developers List] <[hidden email]
>> <http://user/SendEmail.jtp?type=node=21524=0>> schrieb am So., 7.
>> Mai 2017 um 21:17 Uhr:
>>
>>> https://issues.apache.org/jira/browse/SPARK-5293
>>>
>>>
>>> On 05/07/2017 08:59 PM, geoHeil wrote:
>>>
>>> > Hi,
>>> >
>>> > I am curious why spark (with 2.0 completely) removed any akka
>>> dependencies
>>> > for RPC and switched entirely to (as far as I know natty)
>>> >
>>> > regards,
>>> > Georg
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522.html
>>> > Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com <http://nabble.com>.
>>> >
>>> > -
>>> > To unsubscribe e-mail: [hidden email]
>>> <http://user/SendEmail.jtp?type=node=21523=0>
>>> >
>>>
>>
>> -
>> To unsubscribe e-mail: [hidden email]
>> <http://user/SendEmail.jtp?type=node=21523=1>
>>
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522p21523.html
>> To unsubscribe from Why did spark switch from AKKA to net / ..., click
>> here.
>> NAML
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
> --
> View this message in context: Re: Why did spark switch from AKKA to net /
> ...
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522p21524.html>
>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com
> <http://nabble.com>.
>
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522p21526.html
> To unsubscribe from Why did spark switch from AKKA to net / ..., click
> here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=21522=Z2Vvcmcua2YuaGVpbGVyQGdtYWlsLmNvbXwyMTUyMnwtMTgzMzc4NTU4MQ==>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522p21527.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Why did spark switch from AKKA to net / ...

2017-05-07 Thread geoHeil
Thank you!
In the issue they outline that hard wired dependencies were the problem.
But wouldn't one want to not directly accept the messages from an actor but
have Kafka as an failsafe intermediary?

zero323 [via Apache Spark Developers List] <
ml+s1001551n21523...@n3.nabble.com> schrieb am So., 7. Mai 2017 um
21:17 Uhr:

> https://issues.apache.org/jira/browse/SPARK-5293
>
>
> On 05/07/2017 08:59 PM, geoHeil wrote:
>
> > Hi,
> >
> > I am curious why spark (with 2.0 completely) removed any akka
> dependencies
> > for RPC and switched entirely to (as far as I know natty)
> >
> > regards,
> > Georg
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node=21523=0>
> >
>
>
> -
> To unsubscribe e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node=21523=1>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522p21523.html
> To unsubscribe from Why did spark switch from AKKA to net / ..., click
> here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=21522=Z2Vvcmcua2YuaGVpbGVyQGdtYWlsLmNvbXwyMTUyMnwtMTgzMzc4NTU4MQ==>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522p21524.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Why did spark switch from AKKA to net / ...

2017-05-07 Thread geoHeil
Hi,

I am curious why spark (with 2.0 completely) removed any akka dependencies
for RPC and switched entirely to (as far as I know natty)

regards,
Georg



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-did-spark-switch-from-AKKA-to-net-tp21522.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: handling of empty partitions

2017-01-08 Thread geoHeil
Thanks a lot, Holden.

@Liang-Chi Hsieh did you try to run
https://gist.github.com/geoHeil/6a23d18ccec085d486165089f9f430f2 for me
that is crashing in either line 51 or 58. Holden described the problem
pretty well. Ist it clear for you now?

Cheers,
Georg

Holden Karau [via Apache Spark Developers List] <
ml-node+s1001551n20516...@n3.nabble.com> schrieb am Mo., 9. Jan. 2017 um
06:40 Uhr:

> Hi Georg,
>
> Thanks for the question along with the code (as well as posting to stack
> overflow). In general if a question is well suited for stackoverflow its
> probably better suited to the user@ list instead of the dev@ list so I've
> cc'd the user@ list for you.
>
> As far as handling empty partitions when working mapPartitions (and
> similar), the general approach is to return an empty iterator of the
> correct type when you have an empty input iterator.
>
> It looks like your code is doing this, however it seems like you likely
> have a bug in your application logic (namely it assumes that if a partition
> has a record missing a value it will either have had a previous row in the
> same partition which is good OR that the previous partition is not empty
> and has a good row - which need not necessarily be the case). You've
> partially fixed this problem by going through and for each partition
> collecting the last previous good value, and then if you don't have a good
> value at the start of a partition look up the value in the collected array.
>
> However, if this also happens at the same time the previous partition is
> empty, you will need to go and lookup the previous previous partition value
> until you find the one you are looking for. (Note this assumes that the
> first record in your dataset is valid, if it isn't your code will still
> fail).
>
> Your solution is really close to working but just has some minor
> assumptions which don't always necessarily hold.
>
> Cheers,
>
> Holden :)
> On Sun, Jan 8, 2017 at 8:30 PM, Liang-Chi Hsieh <[hidden email]
> <http:///user/SendEmail.jtp?type=node=20516=0>> wrote:
>
>
> Hi Georg,
>
> Can you describe your question more clear?
>
> Actually, the example codes you posted in stackoverflow doesn't crash as
> you
> said in the post.
>
>
> geoHeil wrote
> > I am working on building a custom ML pipeline-model / estimator to impute
> > missing values, e.g. I want to fill with last good known value.
> > Using a window function is slow / will put the data into a single
> > partition.
> > I built some sample code to use the RDD API however, it some None / null
> > problems with empty partitions.
> >
> > How should this be implemented properly to handle such empty partitions?
> >
> http://stackoverflow.com/questions/41474175/spark-mappartitionswithindex-handling-empty-partitions
> >
> > Kind regards,
> > Georg
>
>
>
>
>
> -
>
>
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/handling-of-empty-partitions-tp20496p20515.html
>
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
>
> To unsubscribe e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node=20516=1>
>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/handling-of-empty-partitions-tp20496p20516.html
> To unsubscribe from handling of empty partitions, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=20496=Z2Vvcmcua2YuaGVpbGVyQGdtYWlsLmNvbXwyMDQ5NnwtMTgzMzc4NTU4MQ==>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/handling-of-empty-partitions-tp20496p20518.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

handling of empty partitions

2017-01-06 Thread geoHeil
I am working on building a custom ML pipeline-model / estimator to impute
missing values, e.g. I want to fill with last good known value.
Using a window function is slow / will put the data into a single partition.
I built some sample code to use the RDD API however, it some None / null
problems with empty partitions.

How should this be implemented properly to handle such empty partitions?
http://stackoverflow.com/questions/41474175/spark-mappartitionswithindex-handling-empty-partitions

Kind regards,
Georg



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/handling-of-empty-partitions-tp20496.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Clarification about typesafe aggregations

2017-01-04 Thread geoHeil
Thanks for the clarification.
rxin [via Apache Spark Developers List] <
ml-node+s1001551n20462...@n3.nabble.com> schrieb am Mi. 4. Jan. 2017 um
23:37:

> Your understanding is correct - it is indeed slower due to extra
> serialization. In some cases we can get rid of the serialization if the
> value is already deserialized.
>
> On Wed, Jan 4, 2017 at 7:19 AM, geoHeil <[hidden email]
> <http:///user/SendEmail.jtp?type=node=20462=0>> wrote:
>
> Hi I would like to know more about typeface aggregations in spark.
>
>
> http://stackoverflow.com/questions/40596638/inquiries-about-spark-2-0-dataset/40602882?noredirect=1#comment70139481_40602882
> An example of these is
> https://blog.codecentric.de/en/2016/07/spark-2-0-datasets-case-classes/
> ds.groupByKey(body => body.color)
>
> does
> "myDataSet.map(foo.someVal) is type safe but as any Dataset operation uses
> RDD and compared to DataFrame operations there is a significant overhead.
> Let's take a look at a simple example:"
> hold true e.g. will type safe aggregation require the deserialisation of
> the
> full objects as displayed for
> ds.map(_.foo).explain ?
>
> Kind regards,
> Georg
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Clarification-about-typesafe-aggregations-tp20459.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
>
> To unsubscribe e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node=20462=1>
>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Clarification-about-typesafe-aggregations-tp20459p20462.html
> To unsubscribe from Clarification about typesafe aggregations, click here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=20459=Z2Vvcmcua2YuaGVpbGVyQGdtYWlsLmNvbXwyMDQ1OXwtMTgzMzc4NTU4MQ==>
> .
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Clarification-about-typesafe-aggregations-tp20459p20463.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Clarification about typesafe aggregations

2017-01-04 Thread geoHeil
Hi I would like to know more about typeface aggregations in spark.

http://stackoverflow.com/questions/40596638/inquiries-about-spark-2-0-dataset/40602882?noredirect=1#comment70139481_40602882
An example of these is
https://blog.codecentric.de/en/2016/07/spark-2-0-datasets-case-classes/
ds.groupByKey(body => body.color)

does
"myDataSet.map(foo.someVal) is type safe but as any Dataset operation uses
RDD and compared to DataFrame operations there is a significant overhead.
Let's take a look at a simple example:"
hold true e.g. will type safe aggregation require the deserialisation of the
full objects as displayed for 
ds.map(_.foo).explain ?

Kind regards,
Georg



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Clarification-about-typesafe-aggregations-tp20459.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org