pyspark does not seem to start py4j callback server

2015-11-24 Thread girishlg
Hi

We have a use case where we call a scala function with a python object as a
callback. The python object implements a scala trait. The call to scala
function goes through but when it makes a call back through the passed in
python object we get a connection refused error. Looking further we noticed
the java_gateway.py launch_gateway method initiates the gateway object with
this line 

# Connect to the gateway
   gateway = JavaGateway(GatewayClient(port=gateway_port),
auto_convert=True)

and as per https://www.py4j.org/py4j_java_gateway.html the callback server
is started only if callback_server_parameters is passed.

Could someone please help us understand if in fact the callback server is
not started and if so any particular reason why it is disabled?

Thanks for any pointers
Girish




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-does-not-seem-to-start-py4j-callback-server-tp15341.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-24 Thread Sandy Ryza
I think that Kostas' logic still holds.  The majority of Spark users, and
likely an even vaster majority of people running vaster jobs, are still on
RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
to upgrade to the stable version of the Dataset / DataFrame API so they
don't need to do so twice.  Requiring that they absorb all the other ways
that Spark breaks compatibility in the move to 2.0 makes it much more
difficult for them to make this transition.

Using the same set of APIs also means that it will be easier to backport
critical fixes to the 1.x line.

It's not clear to me that avoiding breakage of an experimental API in the
1.x line outweighs these issues.

-Sandy

On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin  wrote:

> I actually think the next one (after 1.6) should be Spark 2.0. The reason
> is that I already know we have to break some part of the DataFrame/Dataset
> API as part of the Dataset design. (e.g. DataFrame.map should return
> Dataset rather than RDD). In that case, I'd rather break this sooner (in
> one release) than later (in two releases). so the damage is smaller.
>
> I don't think whether we call Dataset/DataFrame experimental or not
> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
> then mark them as stable in 2.1. Despite being "experimental", there has
> been no breaking changes to DataFrame from 1.3 to 1.6.
>
>
>
> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
> wrote:
>
>> Ah, got it; by "stabilize" you meant changing the API, not just bug
>> fixing.  We're on the same page now.
>>
>> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
>> wrote:
>>
>>> A 1.6.x release will only fix bugs - we typically don't change APIs in z
>>> releases. The Dataset API is experimental and so we might be changing the
>>> APIs before we declare it stable. This is why I think it is important to
>>> first stabilize the Dataset API with a Spark 1.7 release before moving to
>>> Spark 2.0. This will benefit users that would like to use the new Dataset
>>> APIs but can't move to Spark 2.0 because of the backwards incompatible
>>> changes, like removal of deprecated APIs, Scala 2.11 etc.
>>>
>>> Kostas
>>>
>>>
>>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra 
>>> wrote:
>>>
 Why does stabilization of those two features require a 1.7 release
 instead of 1.6.1?

 On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis 
 wrote:

> We have veered off the topic of Spark 2.0 a little bit here - yes we
> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like
> to propose we have one more 1.x release after Spark 1.6. This will allow 
> us
> to stabilize a few of the new features that were added in 1.6:
>
> 1) the experimental Datasets API
> 2) the new unified memory manager.
>
> I understand our goal for Spark 2.0 is to offer an easy transition but
> there will be users that won't be able to seamlessly upgrade given what we
> have discussed as in scope for 2.0. For these users, having a 1.x release
> with these new features/APIs stabilized will be very beneficial. This 
> might
> make Spark 1.7 a lighter release but that is not necessarily a bad thing.
>
> Any thoughts on this timeline?
>
> Kostas Sakellis
>
>
>
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao 
> wrote:
>
>> Agree, more features/apis/optimization need to be added in DF/DS.
>>
>>
>>
>> I mean, we need to think about what kind of RDD APIs we have to
>> provide to developer, maybe the fundamental API is enough, like, the
>> ShuffledRDD etc..  But PairRDDFunctions probably not in this category, as
>> we can do the same thing easily with DF/DS, even better performance.
>>
>>
>>
>> *From:* Mark Hamstra [mailto:m...@clearstorydata.com]
>> *Sent:* Friday, November 13, 2015 11:23 AM
>> *To:* Stephen Boesch
>>
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Hmmm... to me, that seems like precisely the kind of thing that
>> argues for retaining the RDD API but not as the first thing presented to
>> new Spark developers: "Here's how to use groupBy with DataFrames 
>> Until
>> the optimizer is more fully developed, that won't always get you the best
>> performance that can be obtained.  In these particular circumstances, 
>> ...,
>> you may want to use the low-level RDD API while setting
>> preservesPartitioning to true.  Like this"
>>
>>
>>
>> On Thu, Nov 12, 2015 at 7:05 PM, Stephen Boesch 
>> wrote:
>>
>> My understanding is that  the RDD's presently have more support for
>> complete control of partitioning which is a key 

Re: [ANNOUNCE] Spark 1.6.0 Release Preview

2015-11-24 Thread Ted Yu
If I am not mistaken, the binaries for Scala 2.11 were generated against
hadoop 1.

What about binaries for Scala 2.11 against hadoop 2.x ?

Cheers

On Sun, Nov 22, 2015 at 2:21 PM, Michael Armbrust 
wrote:

> In order to facilitate community testing of Spark 1.6.0, I'm excited to
> announce the availability of an early preview of the release. This is not a
> release candidate, so there is no voting involved. However, it'd be awesome
> if community members can start testing with this preview package and report
> any problems they encounter.
>
> This preview package contains all the commits to branch-1.6
>  till commit
> 308381420f51b6da1007ea09a02d740613a226e0
> .
>
> The staging maven repository for this preview build can be found here:
> https://repository.apache.org/content/repositories/orgapachespark-1162
>
> Binaries for this preview build can be found here:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-preview2-bin/
>
> A build of the docs can also be found here:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-preview2-docs/
>
> The full change log for this release can be found on JIRA
> 
> .
>
> *== How can you help? ==*
>
> If you are a Spark user, you can help us test this release by taking a
> Spark workload and running on this preview release, then reporting any
> regressions.
>
> *== Major Features ==*
>
> When testing, we'd appreciate it if users could focus on areas that have
> changed in this release.  Some notable new features include:
>
> SPARK-11787  *Parquet
> Performance* - Improve Parquet scan performance when using flat schemas.
> SPARK-10810  *Session *
> *Management* - Multiple users of the thrift (JDBC/ODBC) server now have
> isolated sessions including their own default database (i.e USE mydb)
> even on shared clusters.
> SPARK-   *Dataset
> API* - A new, experimental type-safe API (similar to RDDs) that performs
> many operations on serialized binary data and code generation (i.e. Project
> Tungsten)
> SPARK-1  *Unified
> Memory Management* - Shared memory for execution and caching instead of
> exclusive division of the regions.
> SPARK-10978  *Datasource
> API Avoid Double Filter* - When implementing a datasource with filter
> pushdown, developers can now tell Spark SQL to avoid double evaluating a
> pushed-down filter.
> SPARK-2629   *New
> improved state management* - trackStateByKey - a DStream transformation
> for stateful stream processing, supersedes updateStateByKey in
> functionality and performance.
>
> Happy testing!
>
> Michael
>
>


Re: A proposal for Spark 2.0

2015-11-24 Thread Matei Zaharia
What are the other breaking changes in 2.0 though? Note that we're not removing 
Scala 2.10, we're just making the default build be against Scala 2.11 instead 
of 2.10. There seem to be very few changes that people would worry about. If 
people are going to update their apps, I think it's better to make the other 
small changes in 2.0 at the same time than to update once for Dataset and 
another time for 2.0.

BTW just refer to Reynold's original post for the other proposed API changes.

Matei

> On Nov 24, 2015, at 12:27 PM, Sandy Ryza  wrote:
> 
> I think that Kostas' logic still holds.  The majority of Spark users, and 
> likely an even vaster majority of people running vaster jobs, are still on 
> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want to 
> upgrade to the stable version of the Dataset / DataFrame API so they don't 
> need to do so twice.  Requiring that they absorb all the other ways that 
> Spark breaks compatibility in the move to 2.0 makes it much more difficult 
> for them to make this transition.
> 
> Using the same set of APIs also means that it will be easier to backport 
> critical fixes to the 1.x line.
> 
> It's not clear to me that avoiding breakage of an experimental API in the 1.x 
> line outweighs these issues.
> 
> -Sandy 
> 
> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin  > wrote:
> I actually think the next one (after 1.6) should be Spark 2.0. The reason is 
> that I already know we have to break some part of the DataFrame/Dataset API 
> as part of the Dataset design. (e.g. DataFrame.map should return Dataset 
> rather than RDD). In that case, I'd rather break this sooner (in one release) 
> than later (in two releases). so the damage is smaller.
> 
> I don't think whether we call Dataset/DataFrame experimental or not matters 
> too much for 2.0. We can still call Dataset experimental in 2.0 and then mark 
> them as stable in 2.1. Despite being "experimental", there has been no 
> breaking changes to DataFrame from 1.3 to 1.6.
> 
> 
> 
> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra  > wrote:
> Ah, got it; by "stabilize" you meant changing the API, not just bug fixing.  
> We're on the same page now.
> 
> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis  > wrote:
> A 1.6.x release will only fix bugs - we typically don't change APIs in z 
> releases. The Dataset API is experimental and so we might be changing the 
> APIs before we declare it stable. This is why I think it is important to 
> first stabilize the Dataset API with a Spark 1.7 release before moving to 
> Spark 2.0. This will benefit users that would like to use the new Dataset 
> APIs but can't move to Spark 2.0 because of the backwards incompatible 
> changes, like removal of deprecated APIs, Scala 2.11 etc.
> 
> Kostas
> 
> 
> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra  > wrote:
> Why does stabilization of those two features require a 1.7 release instead of 
> 1.6.1?
> 
> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis  > wrote:
> We have veered off the topic of Spark 2.0 a little bit here - yes we can talk 
> about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd like to propose 
> we have one more 1.x release after Spark 1.6. This will allow us to stabilize 
> a few of the new features that were added in 1.6:
> 
> 1) the experimental Datasets API
> 2) the new unified memory manager.
> 
> I understand our goal for Spark 2.0 is to offer an easy transition but there 
> will be users that won't be able to seamlessly upgrade given what we have 
> discussed as in scope for 2.0. For these users, having a 1.x release with 
> these new features/APIs stabilized will be very beneficial. This might make 
> Spark 1.7 a lighter release but that is not necessarily a bad thing.
> 
> Any thoughts on this timeline?
> 
> Kostas Sakellis
> 
> 
> 
> On Thu, Nov 12, 2015 at 8:39 PM, Cheng, Hao  > wrote:
> Agree, more features/apis/optimization need to be added in DF/DS.
> 
>  
> 
> I mean, we need to think about what kind of RDD APIs we have to provide to 
> developer, maybe the fundamental API is enough, like, the ShuffledRDD etc..  
> But PairRDDFunctions probably not in this category, as we can do the same 
> thing easily with DF/DS, even better performance.
> 
>   <>
> From: Mark Hamstra [mailto:m...@clearstorydata.com 
> ] 
> Sent: Friday, November 13, 2015 11:23 AM
> To: Stephen Boesch
> 
> 
> Cc: dev@spark.apache.org 
> Subject: Re: A proposal for Spark 2.0
> 
>  
> 
> Hmmm... to me, that seems like precisely the kind of thing that argues for 
> retaining the RDD API but not as the 

sqlContext vs hivecontext

2015-11-24 Thread Pranay Tonpay
hi ,
when i am using hivecontext, i am able to query for individual columns from
a table as against when using sqlContext where only a select * works 
Is is possible to use sqlContext and still query for specific columns from
a hive table ?


what should I know to implement twitter streaming for pyspark?

2015-11-24 Thread Amir Rahnama
I wanna end the situation where python users of spark need to implement the
twitter source for streaming by themselves. Yuhu!

Anything I need to know, looked at scala and Java code and got some ideas.

-- 
Thanks and Regards,

Amir Hossein Rahnama

*Tel: +46 (0) 761 681 102*
Website: www.ambodi.com
Twitter: @_ambodi 


Re: what should I know to implement twitter streaming for pyspark?

2015-11-24 Thread Julio Antonio Soto de Vicente
Hi Amir,

I believe that the first step should be looking for a library that implements 
the streaming API.


> El 24/11/2015, a las 10:32, Amir Rahnama  escribió:
> 
> I wanna end the situation where python users of spark need to implement the 
> twitter source for streaming by themselves. Yuhu!
> 
> Anything I need to know, looked at scala and Java code and got some ideas.
> 
> -- 
> Thanks and Regards,
> 
> Amir Hossein Rahnama
> 
> Tel: +46 (0) 761 681 102
> Website: www.ambodi.com
> Twitter: @_ambodi


Streaming : stopping output transformations explicitly

2015-11-24 Thread Yogesh Mahajan
Hi,

Is there a way to stop output transformations on a stream without stopping
streamingContext ?

Yogesh Mahajan
SnappayData Inc,
OLTP+OLAP inside Spark for real time analytics