Re: RE: Spark checkpoint problem

2015-11-26 Thread eric wong
I don't think it is a deliberate design.

So you may need do action on the  RDD before the action of 
RDD, if you want to explicitly checkpoint  RDD.


2015-11-26 13:23 GMT+08:00 wyphao.2007 :

> Spark 1.5.2.
>
> 在 2015-11-26 13:19:39,"张志强(旺轩)"  写道:
>
> What’s your spark version?
>
> *发件人:* wyphao.2007 [mailto:wyphao.2...@163.com]
> *发送时间:* 2015年11月26日 10:04
> *收件人:* user
> *抄送:* dev@spark.apache.org
> *主题:* Spark checkpoint problem
>
> I am test checkpoint to understand how it works, My code as following:
>
>
>
> scala> val data = sc.parallelize(List("a", "b", "c"))
>
> data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
> parallelize at :15
>
>
>
> scala> sc.setCheckpointDir("/tmp/checkpoint")
>
> 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be
> non-local if Spark is running on a cluster: /tmp/checkpoint1
>
>
>
> scala> data.checkpoint
>
>
>
> scala> val temp = data.map(item => (item, 1))
>
> temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map
> at :17
>
>
>
> scala> temp.checkpoint
>
>
>
> scala> temp.count
>
>
>
> but I found that only the temp RDD is checkpont in the /tmp/checkpoint
> directory, The data RDD is not checkpointed! I found the doCheckpoint
> function  in the org.apache.spark.rdd.RDD class:
>
>
>
>   private[spark] def doCheckpoint(): Unit = {
>
> RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false,
> ignoreParent = true) {
>
>   if (!doCheckpointCalled) {
>
> doCheckpointCalled = true
>
> if (checkpointData.isDefined) {
>
>   checkpointData.get.checkpoint()
>
> } else {
>
>   dependencies.foreach(_.rdd.doCheckpoint())
>
> }
>
>   }
>
> }
>
>   }
>
>
>
> from the code above, Only the last RDD(In my case is temp) will be
> checkpointed, My question : Is deliberately designed or this is a bug?
>
>
>
> Thank you.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>



-- 
王海华


Re: Unchecked contribution (JIRA and PR)

2015-11-26 Thread Sergio Ramírez

OK, I'll do that. Thanks for the response.

El 17/11/15 a las 01:36, Joseph Bradley escribió:

Hi Sergio,

Apart from apologies about limited review bandwidth (from me too!), I 
wanted to add: It would be interesting to hear what feedback you've 
gotten from users of your package. Perhaps you could collect feedback 
by (a) emailing the user list and (b) adding a note in the Spark 
Packages pointing to the JIRA, and encouraging users to add their 
comments directly to the JIRA.  That'd be a nice way to get a sense of 
use cases and priority.


Thanks for your patience,
Joseph

On Wed, Nov 4, 2015 at 7:23 AM, Sergio Ramírez > wrote:


OK, for me, time is not a problem. I was just worried about there
was no movement in those issues. I think they are good
contributions. For example, I have found no complex discretization
algorithm in MLlib, which is rare. My algorithm, a Spark
implementation of the well-know discretizer developed by Fayyad
and Irani, could be considered a good starting point for the
discretization part. Furthermore, this is also supported by two
scientific articles.

Anyway, I uploaded these two algorithms as two different packages
to spark-packages.org , but I would
like to contribute directly to MLlib. I understand you have a lot
of requests, and it is not possible to include all the
contributions made by the Spark community.

I'll be patient and ready to collaborate.

Thanks again


On 03/11/15 16:30, Jerry Lam wrote:

Sergio, you are not alone for sure. Check the RowSimilarity
implementation [SPARK-4823]. It has been there for 6 months. It
is very likely those which don't merge in the version of spark
that it was developed will never merged because spark changes
quite significantly from version to version if the algorithm
depends a lot of internal api.

On Tue, Nov 3, 2015 at 10:24 AM, Reynold Xin > wrote:

Sergio,

Usually it takes a lot of effort to get something merged into
Spark itself, especially for relatively new algorithms that
might not have established itself yet. I will leave it to
mllib maintainers to comment on the specifics of the
individual algorithms proposed here.

Just another general comment: we have been working on making
packages be as easy to use as possible for Spark users. Right
now it only requires a simple flag to pass to the
spark-submit script to include a package.


On Tue, Nov 3, 2015 at 2:49 AM, Sergio Ramírez
> wrote:

Hello all:

I developed two packages for MLlib in March. These have
been also upload to the spark-packages repository.
Associated to these packages, I created two JIRA's
threads and the correspondent pull requests, which are
listed below:

https://github.com/apache/spark/pull/5184
https://github.com/apache/spark/pull/5170

https://issues.apache.org/jira/browse/SPARK-6531
https://issues.apache.org/jira/browse/SPARK-6509

These remain unassigned in JIRA and unverified in GitHub.

Could anyone explain why are they in this state yet? Is
it normal?

Thanks!

Sergio R.

-- 


Sergio Ramírez Gallego
Research group on Soft Computing and Intelligent
Information Systems,
Dept. Computer Science and Artificial Intelligence,
University of Granada, Granada, Spain.
Email: srami...@decsai.ugr.es 
Research Group URL: http://sci2s.ugr.es/


-

Este correo electrónico y, en su caso, cualquier fichero
anexo al mismo,
contiene información de carácter confidencial
exclusivamente dirigida a
su destinatario o destinatarios. Si no es vd. el
destinatario indicado,
queda notificado que la lectura, utilización, divulgación
y/o copia sin
autorización está prohibida en virtud de la legislación
vigente. En el
caso de haber recibido este correo electrónico por error,
se ruega
notificar inmediatamente esta circunstancia mediante
reenvío a la
dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable)
contain(s)
confidential information that is exclusively addressed to its
recipient(s). If you are not the 

Re: A proposal for Spark 2.0

2015-11-26 Thread Sean Owen
Maintaining both a 1.7 and 2.0 is too much work for the project, which
is over-stretched now. This means that after 1.6 it's just small
maintenance releases in 1.x and no substantial features or evolution.
This means that the "in progress" APIs in 1.x that will stay that way,
unless one updates to 2.x. It's not unreasonable, but means the update
to the 2.x line isn't going to be that optional for users.

Scala 2.10 is already EOL right? Supporting it in 2.x means supporting
it for a couple years, note. 2.10 is still used today, but that's the
point of the current stable 1.x release in general: if you want to
stick to current dependencies, stick to the current release. Although
I think that's the right way to think about support across major
versions in general, I can see that 2.x is more of a required update
for those following the project's fixes and releases. Hence may indeed
be important to just keep supporting 2.10.

I can't see supporting 2.12 at the same time (right?). Is that a
concern? it will be long since GA by the time 2.x is first released.

There's another fairly coherent worldview where development continues
in 1.7 and focuses on finishing the loose ends and lots of bug fixing.
2.0 is delayed somewhat into next year, and by that time supporting
2.11+2.12 and Java 8 looks more feasible and more in tune with
currently deployed versions.

I can't say I have a strong view but I personally hadn't imagined 2.x
would start now.


On Thu, Nov 26, 2015 at 7:00 AM, Reynold Xin  wrote:
> I don't think we should drop support for Scala 2.10, or make it harder in
> terms of operations for people to upgrade.
>
> If there are further objections, I'm going to bump remove the 1.7 version
> and retarget things to 2.0 on JIRA.
>
>
> On Wed, Nov 25, 2015 at 12:54 AM, Sandy Ryza 
> wrote:
>>
>> I see.  My concern is / was that cluster operators will be reluctant to
>> upgrade to 2.0, meaning that developers using those clusters need to stay on
>> 1.x, and, if they want to move to DataFrames, essentially need to port their
>> app twice.
>>
>> I misunderstood and thought part of the proposal was to drop support for
>> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
>> will make it less palatable to cluster administrators than releases in the
>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>>
>> -Sandy
>>
>>
>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia 
>> wrote:
>>>
>>> What are the other breaking changes in 2.0 though? Note that we're not
>>> removing Scala 2.10, we're just making the default build be against Scala
>>> 2.11 instead of 2.10. There seem to be very few changes that people would
>>> worry about. If people are going to update their apps, I think it's better
>>> to make the other small changes in 2.0 at the same time than to update once
>>> for Dataset and another time for 2.0.
>>>
>>> BTW just refer to Reynold's original post for the other proposed API
>>> changes.
>>>
>>> Matei
>>>
>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza  wrote:
>>>
>>> I think that Kostas' logic still holds.  The majority of Spark users, and
>>> likely an even vaster majority of people running vaster jobs, are still on
>>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
>>> to upgrade to the stable version of the Dataset / DataFrame API so they
>>> don't need to do so twice.  Requiring that they absorb all the other ways
>>> that Spark breaks compatibility in the move to 2.0 makes it much more
>>> difficult for them to make this transition.
>>>
>>> Using the same set of APIs also means that it will be easier to backport
>>> critical fixes to the 1.x line.
>>>
>>> It's not clear to me that avoiding breakage of an experimental API in the
>>> 1.x line outweighs these issues.
>>>
>>> -Sandy
>>>
>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin 
>>> wrote:

 I actually think the next one (after 1.6) should be Spark 2.0. The
 reason is that I already know we have to break some part of the
 DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
 should return Dataset rather than RDD). In that case, I'd rather break this
 sooner (in one release) than later (in two releases). so the damage is
 smaller.

 I don't think whether we call Dataset/DataFrame experimental or not
 matters too much for 2.0. We can still call Dataset experimental in 2.0 and
 then mark them as stable in 2.1. Despite being "experimental", there has
 been no breaking changes to DataFrame from 1.3 to 1.6.



 On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
 wrote:
>
> Ah, got it; by "stabilize" you meant changing the API, not just bug
> fixing.  We're on the same page now.
>
> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 

Re: A proposal for Spark 2.0

2015-11-26 Thread Steve Loughran

> On 25 Nov 2015, at 08:54, Sandy Ryza  wrote:
> 
> I see.  My concern is / was that cluster operators will be reluctant to 
> upgrade to 2.0, meaning that developers using those clusters need to stay on 
> 1.x, and, if they want to move to DataFrames, essentially need to port their 
> app twice.
> 
> I misunderstood and thought part of the proposal was to drop support for 2.10 
> though.  If your broad point is that there aren't changes in 2.0 that will 
> make it less palatable to cluster administrators than releases in the 1.x 
> line, then yes, 2.0 as the next release sounds fine to me.
> 
> -Sandy
> 

mixing spark versions in a JAR cluster with compatible hadoop native libs isn't 
so hard: users just deploy them up separately. 

But: 

-mixing Scala version is going to be tricky unless the jobs people submit are 
configured with the different paths
-the history server will need to be of the most latest spark version being 
executed in the cluster

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[no subject]

2015-11-26 Thread Dmitry Tolpeko



question about combining small parquet files

2015-11-26 Thread Nezih Yigitbasi
Hi Spark people,
I have a Hive table that has a lot of small parquet files and I am
creating a data frame out of it to do some processing, but since I have a
large number of splits/files my job creates a lot of tasks, which I don't
want. Basically what I want is the same functionality that Hive provides,
that is, to combine these small input splits into larger ones by specifying
a max split size setting. Is this currently possible with Spark?

I look at coalesce() but with coalesce I can only control the number
of output files not their sizes. And since the total input dataset size
can vary significantly in my case, I cannot just use a fixed partition
count as the size of each output file can get very large. I then looked for
getting the total input size from an rdd to come up with some heuristic to
set the partition count, but I couldn't find any ways to do it (without
modifying the spark source).

Any help is appreciated.

Thanks,
Nezih

PS: this email is the same as my previous email as I learned that my
previous email ended up as spam for many people since I sent it through
nabble, sorry for the double post.


Re: Using spark MLlib without installing Spark

2015-11-26 Thread Debasish Das
Decoupling mlllib and core is difficult...it is not intended to run spark
core 1.5 with spark mllib 1.6 snapshot...core is more stabilized due to new
algorithms getting added to mllib and sometimes you might be tempted to do
that but its not recommend.
On Nov 21, 2015 8:04 PM, "Reynold Xin"  wrote:

> You can use MLlib and Spark directly without "installing anything". Just
> run Spark in local mode.
>
>
> On Sat, Nov 21, 2015 at 4:05 PM, Rad Gruchalski 
> wrote:
>
>> Bowen,
>>
>> What Andy is doing in the notebook is a slightly different thing. He’s
>> using sbt to bring all spark jars (core, mllib, repl, what have you). You
>> could use maven for that. He then creates a repl and submits all the spark
>> code into it.
>> Pretty sure spark unit tests cover similar uses cases. Maybe not mllib
>> per se but this kind of submission.
>>
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Sunday, 22 November 2015 at 01:01, bowen zhang wrote:
>>
>> Thanks Rad for info. I looked into the repo and see some .snb file using
>> spark mllib. Can you give me a more specific place to look for when
>> invoking the mllib functions? What if I just want to invoke some of the ML
>> functions in my HelloWorld.java?
>>
>> --
>> *From:* Rad Gruchalski 
>> *To:* bowen zhang 
>> *Cc:* "dev@spark.apache.org" 
>> *Sent:* Saturday, November 21, 2015 3:43 PM
>> *Subject:* Re: Using spark MLlib without installing Spark
>>
>> Bowen,
>>
>> One project to look at could be spark-notebook:
>> https://github.com/andypetrella/spark-notebook
>> It uses Spark you in the way you intend to use it.
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality:*This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>>
>> On Sunday, 22 November 2015 at 00:38, bowen zhang wrote:
>>
>> Hi folks,
>> I am a big fan of Spark's Mllib package. I have a java web app where I
>> want to run some ml jobs inside the web app. My question is: is there a way
>> to just import spark-core and spark-mllib jars to invoke my ML jobs without
>> installing the entire Spark package? All the tutorials related Spark seems
>> to indicate installing Spark is a pre-condition for this.
>>
>> Thanks,
>> Bowen
>>
>>
>>
>>
>>
>>
>


NettyRpcEnv adverisedPort

2015-11-26 Thread Rad Gruchalski
Dear all,  

I am currently looking at modifying NettyRpcEnv for this PR: 
https://github.com/apache/spark/pull/9608
The functionality which I’m trying to achieve is the following: if there is a 
configuration property spark.driver.advertisedPort, make executors reply to 
advertisedPort instead of spark.driver.port. This would enable NettyRpcEnv work 
correctly on Mesos with Docker bridge networking (something what is working 
spot on for AkkaRcpEnv).

I’ve spent some time looking at the new NettyRpcEnv and I think I know what is 
happening but seeking for confirmation.

The master RpcEndpointAddress appears to be shipped to the executor as part of 
a serialized message, when the executors are requested, inside of the 
NettyRpcEndpointRef. In order to make my PR work, I need to change the 
RcpEndpointAddress shipped to the executors on the master.
I think the best candidate is:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L125

Am I correct that this value becomes the RcpEndpointAddress? If so, my naive 
implementation would be like this:

if (server != null) RpcAddress(host, conf.getInt(“spark.driver.advertisedPort”, 
server.getPort())) else null

however, I am not sure what impact such change would have, about to test my 
changes as soon as the code from master branch builds.

If that is not the correct place to make such change, what would be the best 
place to investigate? Is there an overview of NettyRcpEnv available anywhere?










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.




SparkR read.df Option type doesn't match

2015-11-26 Thread liushiqi9
I try to write a third party datasource plugin for spark.
It works perfect fine  in scala, but in R because I need to pass the
options, which is a map[string,string] in scala, and nothing in R works, I
failed.
I tried use named list in R, it cannot get the value since I use get in my
plugin to get the value in map.

In scala 
option=Map("a"=>"b","c"->"d")

How should I construct option in R to make it work.







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-read-df-Option-type-doesn-t-match-tp15365.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Grid search with Random Forest

2015-11-26 Thread Ndjido Ardo Bar

Hi folks,

Does anyone know whether the Grid Search capability is enabled since the issue 
spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol column doesn't 
exist" when trying to perform a grid search with Spark 1.4.0.

Cheers,
Ardo 




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: A proposal for Spark 2.0

2015-11-26 Thread Reynold Xin
I don't think there are any plan for Scala 2.12 support yet. We can always
add Scala 2.12 support later.


On Thu, Nov 26, 2015 at 12:59 PM, Koert Kuipers  wrote:

> I also thought the idea was to drop 2.10. Do we want to cross build for 3
> scala versions?
> On Nov 25, 2015 3:54 AM, "Sandy Ryza"  wrote:
>
>> I see.  My concern is / was that cluster operators will be reluctant to
>> upgrade to 2.0, meaning that developers using those clusters need to stay
>> on 1.x, and, if they want to move to DataFrames, essentially need to port
>> their app twice.
>>
>> I misunderstood and thought part of the proposal was to drop support for
>> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
>> will make it less palatable to cluster administrators than releases in the
>> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>>
>> -Sandy
>>
>>
>> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia 
>> wrote:
>>
>>> What are the other breaking changes in 2.0 though? Note that we're not
>>> removing Scala 2.10, we're just making the default build be against Scala
>>> 2.11 instead of 2.10. There seem to be very few changes that people would
>>> worry about. If people are going to update their apps, I think it's better
>>> to make the other small changes in 2.0 at the same time than to update once
>>> for Dataset and another time for 2.0.
>>>
>>> BTW just refer to Reynold's original post for the other proposed API
>>> changes.
>>>
>>> Matei
>>>
>>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza 
>>> wrote:
>>>
>>> I think that Kostas' logic still holds.  The majority of Spark users,
>>> and likely an even vaster majority of people running vaster jobs, are still
>>> on RDDs and on the cusp of upgrading to DataFrames.  Users will probably
>>> want to upgrade to the stable version of the Dataset / DataFrame API so
>>> they don't need to do so twice.  Requiring that they absorb all the other
>>> ways that Spark breaks compatibility in the move to 2.0 makes it much more
>>> difficult for them to make this transition.
>>>
>>> Using the same set of APIs also means that it will be easier to backport
>>> critical fixes to the 1.x line.
>>>
>>> It's not clear to me that avoiding breakage of an experimental API in
>>> the 1.x line outweighs these issues.
>>>
>>> -Sandy
>>>
>>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin 
>>> wrote:
>>>
 I actually think the next one (after 1.6) should be Spark 2.0. The
 reason is that I already know we have to break some part of the
 DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
 should return Dataset rather than RDD). In that case, I'd rather break this
 sooner (in one release) than later (in two releases). so the damage is
 smaller.

 I don't think whether we call Dataset/DataFrame experimental or not
 matters too much for 2.0. We can still call Dataset experimental in 2.0 and
 then mark them as stable in 2.1. Despite being "experimental", there has
 been no breaking changes to DataFrame from 1.3 to 1.6.



 On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
 wrote:

> Ah, got it; by "stabilize" you meant changing the API, not just bug
> fixing.  We're on the same page now.
>
> On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
> wrote:
>
>> A 1.6.x release will only fix bugs - we typically don't change APIs
>> in z releases. The Dataset API is experimental and so we might be 
>> changing
>> the APIs before we declare it stable. This is why I think it is important
>> to first stabilize the Dataset API with a Spark 1.7 release before moving
>> to Spark 2.0. This will benefit users that would like to use the new
>> Dataset APIs but can't move to Spark 2.0 because of the backwards
>> incompatible changes, like removal of deprecated APIs, Scala 2.11 etc.
>>
>> Kostas
>>
>>
>> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>>
>>> Why does stabilization of those two features require a 1.7 release
>>> instead of 1.6.1?
>>>
>>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>>> kos...@cloudera.com> wrote:
>>>
 We have veered off the topic of Spark 2.0 a little bit here - yes
 we can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd
 like to propose we have one more 1.x release after Spark 1.6. This will
 allow us to stabilize a few of the new features that were added in 1.6:

 1) the experimental Datasets API
 2) the new unified memory manager.

 I understand our goal for Spark 2.0 is to offer an easy transition
 but there will be users that won't be able to seamlessly upgrade given 
 

Re: NettyRpcEnv adverisedPort

2015-11-26 Thread Shixiong Zhu
I think you are right. The executor gets the driver port from
"RpcEnv.address".

Best Regards,
Shixiong Zhu

2015-11-26 11:45 GMT-08:00 Rad Gruchalski :

> Dear all,
>
> I am currently looking at modifying NettyRpcEnv for this PR:
> https://github.com/apache/spark/pull/9608
> The functionality which I’m trying to achieve is the following: if there
> is a configuration property spark.driver.advertisedPort, make executors
> reply to advertisedPort instead of spark.driver.port. This would enable
> NettyRpcEnv work correctly on Mesos with Docker bridge networking
> (something what is working spot on for AkkaRcpEnv).
>
> I’ve spent some time looking at the new NettyRpcEnv and I think I know
> what is happening but seeking for confirmation.
>
> The master RpcEndpointAddress appears to be shipped to the executor as
> part of a serialized message, when the executors are requested, inside of
> the NettyRpcEndpointRef. In order to make my PR work, I need to change the
> RcpEndpointAddress shipped to the executors on the master.
> I think the best candidate is:
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L125
>
> Am I correct that this value becomes the RcpEndpointAddress? If so, my
> naive implementation would be like this:
>
> if (server != null) RpcAddress(host,
> conf.getInt(“spark.driver.advertisedPort”, server.getPort())) else null
>
> however, I am not sure what impact such change would have, about to test
> my changes as soon as the code from master branch builds.
>
> If that is not the correct place to make such change, what would be the
> best place to investigate? Is there an overview of NettyRcpEnv available
> anywhere?
>
> Kind regards,
> Radek Gruchalski
> ra...@gruchalski.com 
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>


Re: A proposal for Spark 2.0

2015-11-26 Thread Koert Kuipers
I also thought the idea was to drop 2.10. Do we want to cross build for 3
scala versions?
On Nov 25, 2015 3:54 AM, "Sandy Ryza"  wrote:

> I see.  My concern is / was that cluster operators will be reluctant to
> upgrade to 2.0, meaning that developers using those clusters need to stay
> on 1.x, and, if they want to move to DataFrames, essentially need to port
> their app twice.
>
> I misunderstood and thought part of the proposal was to drop support for
> 2.10 though.  If your broad point is that there aren't changes in 2.0 that
> will make it less palatable to cluster administrators than releases in the
> 1.x line, then yes, 2.0 as the next release sounds fine to me.
>
> -Sandy
>
>
> On Tue, Nov 24, 2015 at 11:55 AM, Matei Zaharia 
> wrote:
>
>> What are the other breaking changes in 2.0 though? Note that we're not
>> removing Scala 2.10, we're just making the default build be against Scala
>> 2.11 instead of 2.10. There seem to be very few changes that people would
>> worry about. If people are going to update their apps, I think it's better
>> to make the other small changes in 2.0 at the same time than to update once
>> for Dataset and another time for 2.0.
>>
>> BTW just refer to Reynold's original post for the other proposed API
>> changes.
>>
>> Matei
>>
>> On Nov 24, 2015, at 12:27 PM, Sandy Ryza  wrote:
>>
>> I think that Kostas' logic still holds.  The majority of Spark users, and
>> likely an even vaster majority of people running vaster jobs, are still on
>> RDDs and on the cusp of upgrading to DataFrames.  Users will probably want
>> to upgrade to the stable version of the Dataset / DataFrame API so they
>> don't need to do so twice.  Requiring that they absorb all the other ways
>> that Spark breaks compatibility in the move to 2.0 makes it much more
>> difficult for them to make this transition.
>>
>> Using the same set of APIs also means that it will be easier to backport
>> critical fixes to the 1.x line.
>>
>> It's not clear to me that avoiding breakage of an experimental API in the
>> 1.x line outweighs these issues.
>>
>> -Sandy
>>
>> On Mon, Nov 23, 2015 at 10:51 PM, Reynold Xin 
>> wrote:
>>
>>> I actually think the next one (after 1.6) should be Spark 2.0. The
>>> reason is that I already know we have to break some part of the
>>> DataFrame/Dataset API as part of the Dataset design. (e.g. DataFrame.map
>>> should return Dataset rather than RDD). In that case, I'd rather break this
>>> sooner (in one release) than later (in two releases). so the damage is
>>> smaller.
>>>
>>> I don't think whether we call Dataset/DataFrame experimental or not
>>> matters too much for 2.0. We can still call Dataset experimental in 2.0 and
>>> then mark them as stable in 2.1. Despite being "experimental", there has
>>> been no breaking changes to DataFrame from 1.3 to 1.6.
>>>
>>>
>>>
>>> On Wed, Nov 18, 2015 at 3:43 PM, Mark Hamstra 
>>> wrote:
>>>
 Ah, got it; by "stabilize" you meant changing the API, not just bug
 fixing.  We're on the same page now.

 On Wed, Nov 18, 2015 at 3:39 PM, Kostas Sakellis 
 wrote:

> A 1.6.x release will only fix bugs - we typically don't change APIs in
> z releases. The Dataset API is experimental and so we might be changing 
> the
> APIs before we declare it stable. This is why I think it is important to
> first stabilize the Dataset API with a Spark 1.7 release before moving to
> Spark 2.0. This will benefit users that would like to use the new Dataset
> APIs but can't move to Spark 2.0 because of the backwards incompatible
> changes, like removal of deprecated APIs, Scala 2.11 etc.
>
> Kostas
>
>
> On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <
> m...@clearstorydata.com> wrote:
>
>> Why does stabilization of those two features require a 1.7 release
>> instead of 1.6.1?
>>
>> On Fri, Nov 13, 2015 at 11:40 AM, Kostas Sakellis <
>> kos...@cloudera.com> wrote:
>>
>>> We have veered off the topic of Spark 2.0 a little bit here - yes we
>>> can talk about RDD vs. DS/DF more but lets refocus on Spark 2.0. I'd 
>>> like
>>> to propose we have one more 1.x release after Spark 1.6. This will 
>>> allow us
>>> to stabilize a few of the new features that were added in 1.6:
>>>
>>> 1) the experimental Datasets API
>>> 2) the new unified memory manager.
>>>
>>> I understand our goal for Spark 2.0 is to offer an easy transition
>>> but there will be users that won't be able to seamlessly upgrade given 
>>> what
>>> we have discussed as in scope for 2.0. For these users, having a 1.x
>>> release with these new features/APIs stabilized will be very beneficial.
>>> This might make Spark 1.7 a lighter release but that is not necessarily 
>>> a
>>> bad thing.
>>>

Re: SparkR read.df Option type doesn't match

2015-11-26 Thread liushiqi9
I found the answer myself.
options should be added like:
read.df(sqlContext,path=NULL,source="***",option1="",option2="" )





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-read-df-Option-type-doesn-t-match-tp15365p15370.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: NettyRpcEnv adverisedPort

2015-11-26 Thread Rad Gruchalski
I did test my change: 
https://github.com/radekg/spark/commit/b21aae1468169ee0a388d33ba6cebdb17b895956#diff-0c89b4a60c30a7cd2224bb64d93da942R125
  
It seems to do what I want it to do, however, not quite sure about the overall 
impact.

I’d appreciate if someone who knows the NettyRpcEnv details could point to some 
documentation.










Kind regards,

Radek Gruchalski

ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
(mailto:ra...@gruchalski.com)
de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)

Confidentiality:
This communication is intended for the above-named person and may be 
confidential and/or legally privileged.
If it has come to you in error you must take no action based on it, nor must 
you copy or show it to anyone; please delete/destroy and inform the sender 
immediately.



On Thursday, 26 November 2015 at 20:45, Rad Gruchalski wrote:

> Dear all,  
>  
> I am currently looking at modifying NettyRpcEnv for this PR: 
> https://github.com/apache/spark/pull/9608
> The functionality which I’m trying to achieve is the following: if there is a 
> configuration property spark.driver.advertisedPort, make executors reply to 
> advertisedPort instead of spark.driver.port. This would enable NettyRpcEnv 
> work correctly on Mesos with Docker bridge networking (something what is 
> working spot on for AkkaRcpEnv).
>  
> I’ve spent some time looking at the new NettyRpcEnv and I think I know what 
> is happening but seeking for confirmation.
>  
> The master RpcEndpointAddress appears to be shipped to the executor as part 
> of a serialized message, when the executors are requested, inside of the 
> NettyRpcEndpointRef. In order to make my PR work, I need to change the 
> RcpEndpointAddress shipped to the executors on the master.
> I think the best candidate is:
>  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala#L125
>  
> Am I correct that this value becomes the RcpEndpointAddress? If so, my naive 
> implementation would be like this:
>  
> if (server != null) RpcAddress(host, 
> conf.getInt(“spark.driver.advertisedPort”, server.getPort())) else null
>  
> however, I am not sure what impact such change would have, about to test my 
> changes as soon as the code from master branch builds.
>  
> If that is not the correct place to make such change, what would be the best 
> place to investigate? Is there an overview of NettyRcpEnv available anywhere?
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
> Kind regards,

> Radek Gruchalski
> 
ra...@gruchalski.com (mailto:ra...@gruchalski.com)
 
> (mailto:ra...@gruchalski.com)
> de.linkedin.com/in/radgruchalski/ (http://de.linkedin.com/in/radgruchalski/)
>  
> Confidentiality:
> This communication is intended for the above-named person and may be 
> confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor must 
> you copy or show it to anyone; please delete/destroy and inform the sender 
> immediately.
>  
>  
>  
>  




Re: tests blocked at "don't call ssc.stop in listener"

2015-11-26 Thread Saisai Shao
Might be related to this JIRA (
https://issues.apache.org/jira/browse/SPARK-11761), not very sure about it.

On Fri, Nov 27, 2015 at 10:22 AM, Nan Zhu  wrote:

> Hi, all
>
> Anyone noticed that some of the tests just blocked at the test case “don't
> call ssc.stop in listener” in StreamingListenerSuite?
>
> Examples:
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46766/console
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46776/console
>
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46774/console
>
>
> I originally found it in my own PR, and I thought it is a bug introduced
> by me….but later I found that the tests for the PRs on different things
> also blocked at the same point …
>
> Just filed a JIRA https://issues.apache.org/jira/browse/SPARK-12021
>
>
> Best,
>
> --
> Nan Zhu
> http://codingcat.me
>
>


steamingContext stop gracefully failed in yarn-cluster mode

2015-11-26 Thread qinggangwa...@gmail.com






Hi all,   I try to stop the streamingContext gracefully in yarn-cluster mode. 
But it seemes that the job is stopped and start again when I use 
ssc.stop(true,true).  And the job is stopped when I use ssc.stop(true).  Does 
it means that the steamingContext cannot be stopped gracefully inyarn-cluster 
mode? My spark versoin is 1.4.1. In addition, the steamingContext can be 
stopped gracefully in local mode. 


qinggangwa...@gmail.com



tests blocked at "don't call ssc.stop in listener"

2015-11-26 Thread Nan Zhu
Hi, all

Anyone noticed that some of the tests just blocked at the test case “don't call 
ssc.stop in listener” in StreamingListenerSuite?

Examples:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46766/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46776/console


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46774/console


I originally found it in my own PR, and I thought it is a bug introduced by 
me….but later I found that the tests for the PRs on different things also 
blocked at the same point …

Just filed a JIRA https://issues.apache.org/jira/browse/SPARK-12021


Best,  

--  
Nan Zhu
http://codingcat.me



Re: tests blocked at "don't call ssc.stop in listener"

2015-11-26 Thread Shixiong Zhu
Just found a potential dead-lock in this test. Will send a PR to fix it
soon.

Best Regards,
Shixiong Zhu

2015-11-26 18:55 GMT-08:00 Saisai Shao :

> Might be related to this JIRA (
> https://issues.apache.org/jira/browse/SPARK-11761), not very sure about
> it.
>
> On Fri, Nov 27, 2015 at 10:22 AM, Nan Zhu  wrote:
>
>> Hi, all
>>
>> Anyone noticed that some of the tests just blocked at the test case “don't
>> call ssc.stop in listener” in StreamingListenerSuite?
>>
>> Examples:
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46766/console
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46776/console
>>
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46774/console
>>
>>
>> I originally found it in my own PR, and I thought it is a bug introduced
>> by me….but later I found that the tests for the PRs on different things
>> also blocked at the same point …
>>
>> Just filed a JIRA https://issues.apache.org/jira/browse/SPARK-12021
>>
>>
>> Best,
>>
>> --
>> Nan Zhu
>> http://codingcat.me
>>
>>
>