Re: Automated close of PR's ?

2015-12-31 Thread Mridul Muralidharan
On the contrary, PR's are actually meant to be long lived - and a
reference of discussion about review and changes.
Particularly so for spark since JIRA's and review board are not used
for code review.
Note - they are not used only in spark, but by other organization's to
track contributions (like in our case).

If you look at Reynold's response, he has clarified they were closed
by him and not via user request - he probably missed out on activity
on some of the PR's when closing in bulk.
I would have preferred pinging the PR contributors to close, and
subsequently doing so if inactive after "some" time (and definitely
not when folks are off on vacations).

Regards,
Mridul



On Thu, Dec 31, 2015 at 12:02 AM, Sean Owen  wrote:
> There's a script that can be run manually which closes PRs that have
> been 'requested' to be closed. I'm not sure of the exact words it
> looks for but "Do you mind closing this PR?" seems to work. However it
> does seem to mean that PRs will occasionally get closed as a false
> positive, so maybe that happened here.
>
> You can use your judgment about whether to reopen, but I tend to think
> PRs are not meant to be long-lived. They don't go away even when
> closed, so can always stand as a record of a proposed change or be
> reopened. But there shouldn't be such a thing as a PR open for months.
> (In practice, you can see a huge number of dead, stale PRs are left
> open by people out there anyway)
>
> On Thu, Dec 31, 2015 at 3:25 AM, Mridul Muralidharan  wrote:
>> I am not sure of others, but I had a PR close from under me where
>> ongoing discussion was as late as 2 weeks back.
>> Given this, I assumed it was automated close and not manual !
>>
>> When the change was opened is not a good metric about viability of the
>> change (particularly when it touches code which is rarely modified;
>> and so will merge after multiple releases).
>>
>> Regards,
>> Mridul
>>
>> On Wed, Dec 30, 2015 at 7:12 PM, Reynold Xin  wrote:
>>> No there is not. I actually manually closed them to cut down the number of
>>> open pull requests. Feel free to reopen individual ones.
>>>
>>>
>>> On Wednesday, December 30, 2015, Mridul Muralidharan 
>>> wrote:

 Is there a script running to close "old" PR's ? I was not aware of any
 discussion about this in dev list.

 - Mridul

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

>>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Automated close of PR's ?

2015-12-31 Thread Sean Owen
Yes, I mean "open" for a long time, but I do mean PRs aren't intended
to be open for long periods. Of course, they actually stick around
forever on github.

I think Reynold did manually close yours, but I was noting for the
record that there's also an automated process that does this in
response to a request. That has also surprised people in the past.

Generally, we have way more problem with people abandoning or failing
to follow through on PRs, or simply proposing things that aren't going
to be merged. I agree in general with reflecting the reality by
closing lots more JIRAs and PRs -- mostly because these are not
permanent operations at all, and the intent is that in the occasional
case where the owner disagrees, it can simply be reopened. This serves
as a reminder that we need to drive all of these things to a
conclusion.

On Thu, Dec 31, 2015 at 8:59 AM, Mridul Muralidharan  wrote:
> On the contrary, PR's are actually meant to be long lived - and a
> reference of discussion about review and changes.
> Particularly so for spark since JIRA's and review board are not used
> for code review.
> Note - they are not used only in spark, but by other organization's to
> track contributions (like in our case).
>
> If you look at Reynold's response, he has clarified they were closed
> by him and not via user request - he probably missed out on activity
> on some of the PR's when closing in bulk.
> I would have preferred pinging the PR contributors to close, and
> subsequently doing so if inactive after "some" time (and definitely
> not when folks are off on vacations).
>
> Regards,
> Mridul
>
>
>
> On Thu, Dec 31, 2015 at 12:02 AM, Sean Owen  wrote:
>> There's a script that can be run manually which closes PRs that have
>> been 'requested' to be closed. I'm not sure of the exact words it
>> looks for but "Do you mind closing this PR?" seems to work. However it
>> does seem to mean that PRs will occasionally get closed as a false
>> positive, so maybe that happened here.
>>
>> You can use your judgment about whether to reopen, but I tend to think
>> PRs are not meant to be long-lived. They don't go away even when
>> closed, so can always stand as a record of a proposed change or be
>> reopened. But there shouldn't be such a thing as a PR open for months.
>> (In practice, you can see a huge number of dead, stale PRs are left
>> open by people out there anyway)
>>
>> On Thu, Dec 31, 2015 at 3:25 AM, Mridul Muralidharan  
>> wrote:
>>> I am not sure of others, but I had a PR close from under me where
>>> ongoing discussion was as late as 2 weeks back.
>>> Given this, I assumed it was automated close and not manual !
>>>
>>> When the change was opened is not a good metric about viability of the
>>> change (particularly when it touches code which is rarely modified;
>>> and so will merge after multiple releases).
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Wed, Dec 30, 2015 at 7:12 PM, Reynold Xin  wrote:
 No there is not. I actually manually closed them to cut down the number of
 open pull requests. Feel free to reopen individual ones.


 On Wednesday, December 30, 2015, Mridul Muralidharan 
 wrote:
>
> Is there a script running to close "old" PR's ? I was not aware of any
> discussion about this in dev list.
>
> - Mridul
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Automated close of PR's ?

2015-12-31 Thread Reynold Xin
Hi Mridul,

Thanks for the message. All you said made sense to me.

It can definitely be frustrating when one of your pull requests got
accidentally closed, and that's why we often ping the original authors to
close them. However, this doesn't always work because the original authors
might be inactive for some old pull requests, and sorting these out of 400+
open pull requests can be pretty tough.

Regardless of whether pull requests should be long-lived or short-lived,
closing a pull request does not wipe any of its content. All the changes
and their associated reviews are kept there, and it is trivial to re-open
(one click of a button).

I also agree with your point that the "date of open" is not the best metric
to look at here. Inactivity for a certain period is a much better metric to
use in the future.


On Thu, Dec 31, 2015 at 12:59 AM, Mridul Muralidharan 
wrote:

> On the contrary, PR's are actually meant to be long lived - and a
> reference of discussion about review and changes.
> Particularly so for spark since JIRA's and review board are not used
> for code review.
> Note - they are not used only in spark, but by other organization's to
> track contributions (like in our case).
>
> If you look at Reynold's response, he has clarified they were closed
> by him and not via user request - he probably missed out on activity
> on some of the PR's when closing in bulk.
> I would have preferred pinging the PR contributors to close, and
> subsequently doing so if inactive after "some" time (and definitely
> not when folks are off on vacations).
>
> Regards,
> Mridul
>
>
>
> On Thu, Dec 31, 2015 at 12:02 AM, Sean Owen  wrote:
> > There's a script that can be run manually which closes PRs that have
> > been 'requested' to be closed. I'm not sure of the exact words it
> > looks for but "Do you mind closing this PR?" seems to work. However it
> > does seem to mean that PRs will occasionally get closed as a false
> > positive, so maybe that happened here.
> >
> > You can use your judgment about whether to reopen, but I tend to think
> > PRs are not meant to be long-lived. They don't go away even when
> > closed, so can always stand as a record of a proposed change or be
> > reopened. But there shouldn't be such a thing as a PR open for months.
> > (In practice, you can see a huge number of dead, stale PRs are left
> > open by people out there anyway)
> >
> > On Thu, Dec 31, 2015 at 3:25 AM, Mridul Muralidharan 
> wrote:
> >> I am not sure of others, but I had a PR close from under me where
> >> ongoing discussion was as late as 2 weeks back.
> >> Given this, I assumed it was automated close and not manual !
> >>
> >> When the change was opened is not a good metric about viability of the
> >> change (particularly when it touches code which is rarely modified;
> >> and so will merge after multiple releases).
> >>
> >> Regards,
> >> Mridul
> >>
> >> On Wed, Dec 30, 2015 at 7:12 PM, Reynold Xin 
> wrote:
> >>> No there is not. I actually manually closed them to cut down the
> number of
> >>> open pull requests. Feel free to reopen individual ones.
> >>>
> >>>
> >>> On Wednesday, December 30, 2015, Mridul Muralidharan  >
> >>> wrote:
> 
>  Is there a script running to close "old" PR's ? I was not aware of any
>  discussion about this in dev list.
> 
>  - Mridul
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>  For additional commands, e-mail: dev-h...@spark.apache.org
> 
> >>>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
>


Re: problem with reading source code-pull out nondeterministic expresssions

2015-12-31 Thread 汪洋
I get it, thanks!

> 在 2015年12月31日,上午3:00,Michael Armbrust  写道:
> 
> The goal here is to ensure that the non-deterministic value is evaluated only 
> once, so the result won't change for a given row (i.e. when sorting).
> 
> On Tue, Dec 29, 2015 at 10:57 PM, 汪洋  > wrote:
> Hi fellas,
> I am new to spark and I have a newbie question. I am currently reading the 
> source code in spark sql catalyst analyzer. I not quite understand the 
> partial function in PullOutNondeterministric. What does it mean by "pull 
> out”? Why do we have to do the "pulling out”?
> I would really appreciate it if somebody explain it to me. 
> Thanks. 
> 



Re: Spark streaming 1.6.0-RC4 NullPointerException using mapWithState

2015-12-31 Thread Jan Uyttenhove
Thanks for your answers, and I'm sorry for the sight delay.

I was trying to narrow it down first since I noticed very unpredictable
behaviour in reproducing it. Finally the unpredictability seemed related to
the message format of the messages on Kafka, so I also came to suspect it
had something to do with serialization.

I was using the KryoSerializer, and I can confirm that it is in fact
related to it (no exception when using the JavaSerializer). And it's
unrelated to Kafka.

Apparently the exception occurs when restoring state from previous
checkpoints and using KryoSerialization. When the job has checkpoints from
a previous run (containing state) and is started with new messages already
available on e.g. a Kafka topic (or via nc), the NPE occurs. And this is -
of course - a typical use case of using Kafka with Spark streaming and
checkpointing.

As requested I created issue SPARK-12591 (
https://issues.apache.org/jira/browse/SPARK-12591). It contains the
procedure to reproduce the error with the testcase which is again a
modified version of the StatefulNetworkWordCount Spark streaming example (
https://gist.github.com/juyttenh/9b4a4103699a7d5f698f).

Best regards,
Jan

On Thu, Dec 31, 2015 at 2:32 AM, Ted Yu  wrote:

> I went through StateMap.scala a few times but didn't find any logic error
> yet.
>
> According to the call stack, the following was executed in get(key):
>
> } else {
>   parentStateMap.get(key)
> }
> This implies that parentStateMap was null.
> But it seems parentStateMap is properly assigned in readObject().
>
> Jan:
> Which serializer did you use ?
>
> Thanks
>
> On Tue, Dec 29, 2015 at 3:42 AM, Jan Uyttenhove  wrote:
>
>> Hi guys,
>>
>> I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new
>> mapWithState API, after previously reporting issue SPARK-11932 (
>> https://issues.apache.org/jira/browse/SPARK-11932).
>>
>> My Spark streaming job involves reading data from a Kafka topic (using
>> KafkaUtils.createDirectStream), stateful processing (using checkpointing
>> & mapWithState) & publishing the results back to Kafka.
>>
>> I'm now facing the NullPointerException below when restoring from a
>> checkpoint in the following scenario:
>> 1/ run job (with local[2]), process data from Kafka while creating &
>> keeping state
>> 2/ stop the job
>> 3/ generate some extra message on the input Kafka topic
>> 4/ start the job again (and restore offsets & state from the checkpoints)
>>
>> The problem is caused by (or at least related to) step 3, i.e. publishing
>> data to the input topic while the job is stopped.
>> The above scenario has been tested successfully when:
>> - step 3 is excluded, so restoring state from a checkpoint is successful
>> when no messages are added when the job is stopped
>> - after step 2, the checkpoints are deleted
>>
>> Any clues? Am I doing something wrong here, or is there still a problem
>> with the mapWithState impl?
>>
>> Thanx,
>>
>> Jan
>>
>>
>>
>> 15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage
>> 3.0 (TID 24)
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at
>> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at