from:"Saisai Shao"

Re: Enabling fully disaggregated shuffle on Spark

2019-12-04 Thread Saisai Shao

Hi Ben and Felix, I'm also interested in this. Would you please add me to
the invite, thanks a lot.

Best regards,
Saisai

Greg Lee  于2019年12月2日周一 下午11:34写道：

> Hi Felix & Ben,
>
> This is Li Hao from Baidu, same team with Linhong.
>
> As mentioned in Linhong’s email, independent disaggregated shuffle service
> is also our solution and continuous exploring direction for  improving
> stability of Hadoop MR and Spark in the production environment. We would
> love to hear about this topic from community and share our experience .
>
> Please add me to this event, thanks.
>
> Best Regards
> Li Hao
>
> Liu,Linhong  于2019年11月29日周五 下午5:09写道：
>
>> Hi Felix & Ben,
>>
>> This is Linhong from Baidu based in Beijing, and we are internally using
>> a disaggregated shuffle service (we call it DCE) as well. We launched this
>> in production 3 years ago for Hadoop shuffle. Last year we migrated spark
>> shuffle to the same DCE shuffle service and stability improved a lot (we
>> can handle more than 100T shuffle now).
>>
>> It would be nice if there is a Spark shuffle API support fully
>> disaggregated shuffle and my team and I are very glad to share our
>> experience and help on this topic.
>>
>> So, if It’s possible, please add me to this event.
>>
>>
>>
>> Thanks,
>>
>> Liu, Linhong
>>
>>
>>
>> *From: *Aniket Mokashi 
>> *Date: *Thursday, November 21, 2019 at 2:12 PM
>> *To: *Felix Cheung 
>> *Cc: *Ben Sidhom , John Zhuge <
>> jzh...@apache.org>, bo yang , Amogh Margoor <
>> amo...@qubole.com>, Ryan Blue , Spark Dev List <
>> dev@spark.apache.org>, Christopher Crosbie ,
>> Griselda Cuevas , Holden Karau ,
>> Mayank Ahuja , Kalyan Sivakumar ,
>> "alfo...@fb.com" , Felix Cheung , Matt
>> Cheah , "Yifei Huang (PD)" 
>> *Subject: *Re: Enabling fully disaggregated shuffle on Spark
>>
>>
>>
>> Felix - please add me to this event.
>>
>>
>>
>> Ben - should we move this proposal to a doc and open it up for
>> edits/comments.
>>
>>
>>
>> On Wed, Nov 20, 2019 at 5:37 PM Felix Cheung 
>> wrote:
>>
>> Great!
>>
>>
>>
>> Due to number of constraints I won’t be sending link directly here but
>> please r me and I will add you.
>>
>>
>>
>>
>> --
>>
>> *From:* Ben Sidhom 
>> *Sent:* Wednesday, November 20, 2019 9:10:01 AM
>> *To:* John Zhuge 
>> *Cc:* bo yang ; Amogh Margoor ;
>> Ryan Blue ; Ben Sidhom ;
>> Spark Dev List ; Christopher Crosbie <
>> crosb...@google.com>; Griselda Cuevas ; Holden Karau <
>> hol...@pigscanfly.ca>; Mayank Ahuja ; Kalyan
>> Sivakumar ; alfo...@fb.com ; Felix
>> Cheung ; Matt Cheah ; Yifei Huang
>> (PD) 
>> *Subject:* Re: Enabling fully disaggregated shuffle on Spark
>>
>>
>>
>> That sounds great!
>>
>>
>>
>> On Wed, Nov 20, 2019 at 9:02 AM John Zhuge  wrote:
>>
>> That will be great. Please send us the invite.
>>
>>
>>
>> On Wed, Nov 20, 2019 at 8:56 AM bo yang  wrote:
>>
>> Cool, thanks Ryan, John, Amogh for the reply! Great to see you
>> interested! Felix will have a Spark Scalability & Reliability Sync
>> meeting on Dec 4 1pm PST. We could discuss more details there. Do you want
>> to join?
>>
>>
>>
>> On Tue, Nov 19, 2019 at 4:23 PM Amogh Margoor  wrote:
>>
>> We at Qubole are also looking at disaggregating shuffle on Spark. Would
>> love to collaborate and share learnings.
>>
>>
>>
>> Regards,
>>
>> Amogh
>>
>>
>>
>> On Tue, Nov 19, 2019 at 4:09 PM John Zhuge  wrote:
>>
>> Great work, Bo! Would love to hear the details.
>>
>>
>>
>>
>>
>> On Tue, Nov 19, 2019 at 4:05 PM Ryan Blue 
>> wrote:
>>
>> I'm interested in remote shuffle services as well. I'd love to hear about
>> what you're using in production!
>>
>>
>>
>> rb
>>
>>
>>
>> On Tue, Nov 19, 2019 at 2:43 PM bo yang  wrote:
>>
>> Hi Ben,
>>
>>
>>
>> Thanks for the writing up! This is Bo from Uber. I am in Felix's team in
>> Seattle, and working on disaggregated shuffle (we called it remote shuffle
>> service, RSS, internally). We have put RSS into production for a while, and
>> learned a lot during the work (tried quite a few techniques to improve the
>> remote shuffle performance). We could share our learning with the
>> community, and also would like to hear feedback/suggestions on how to
>> further improve remote shuffle performance. We could chat more details if
>> you or other people are interested.
>>
>>
>>
>> Best,
>>
>> Bo
>>
>>
>>
>> On Fri, Nov 15, 2019 at 4:10 PM Ben Sidhom 
>> wrote:
>>
>> I would like to start a conversation about extending the Spark shuffle
>> manager surface to support fully disaggregated shuffle implementations.
>> This is closely related to the work in SPARK-25299
>> , which is focused on
>> refactoring the shuffle manager API (and in particular, SortShuffleManager)
>> to use a pluggable storage backend. The motivation for that SPIP is further
>> enabling Spark on Kubernetes.
>>
>>
>>
>> The motivation for this proposal is enabling full externalized
>> (disaggregated) shuffle service implementations. (Facebook’s Cosco
>> shuffle
>>

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Saisai Shao

Congratulations!

Jungtaek Lim  于2019年9月9日周一 下午6:11写道：

> Congratulations! Well deserved!
>
> On Tue, Sep 10, 2019 at 9:51 AM John Zhuge  wrote:
>
>> Congratulations!
>>
>> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp  wrote:
>>
>>> congrats everyone!  :)
>>>
>>> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > The Spark PMC recently voted to add several new committers and one PMC
>>> member. Join me in welcoming them to their new roles!
>>> >
>>> > New PMC member: Dongjoon Hyun
>>> >
>>> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming
>>> Wang, Weichen Xu, Ruifeng Zheng
>>> >
>>> > The new committers cover lots of important areas including ML, SQL,
>>> and data sources, so it’s great to have them here. All the best,
>>> >
>>> > Matei and the Spark PMC
>>> >
>>> >
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>

Re: Release Spark 2.3.4

2019-08-18 Thread Saisai Shao

+1

Wenchen Fan  于2019年8月19日周一 上午10:28写道：

> +1
>
> On Sat, Aug 17, 2019 at 3:37 PM Hyukjin Kwon  wrote:
>
>> +1 too
>>
>> 2019년 8월 17일 (토) 오후 3:06, Dilip Biswal 님이 작성:
>>
>>> +1
>>>
>>> Regards,
>>> Dilip Biswal
>>> Tel: 408-463-4980
>>> dbis...@us.ibm.com
>>>
>>>
>>>
>>> - Original message -
>>> From: John Zhuge 
>>> To: Xiao Li 
>>> Cc: Takeshi Yamamuro , Spark dev list <
>>> dev@spark.apache.org>, Kazuaki Ishizaki 
>>> Subject: [EXTERNAL] Re: Release Spark 2.3.4
>>> Date: Fri, Aug 16, 2019 4:33 PM
>>>
>>> +1
>>>
>>> On Fri, Aug 16, 2019 at 4:25 PM Xiao Li  wrote:
>>>
>>> +1
>>>
>>> On Fri, Aug 16, 2019 at 4:11 PM Takeshi Yamamuro 
>>> wrote:
>>>
>>> +1, too
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun 
>>> wrote:
>>>
>>> +1 for 2.3.4 release as the last release for `branch-2.3` EOL.
>>>
>>> Also, +1 for next week release.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Aug 16, 2019 at 8:19 AM Sean Owen  wrote:
>>>
>>> I think it's fine to do these in parallel, yes. Go ahead if you are
>>> willing.
>>>
>>> On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki 
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > Spark 2.3.3 was released six months ago (15th February, 2019) at
>>> http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18
>>> months have been passed after Spark 2.3.0 has been released (28th February,
>>> 2018).
>>> > As of today (16th August), there are 103 commits (69 JIRAs) in
>>> `branch-23` since 2.3.3.
>>> >
>>> > It would be great if we can have Spark 2.3.4.
>>> > If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after
>>> 2.4.4 will be released?
>>> >
>>> > A issue list in jira:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12344844
>>> > A commit list in github from the last release:
>>> https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3
>>> > The 8 correctness issues resolved in branch-2.3:
>>> >
>>> https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>>> >
>>> > Best Regards,
>>> > Kazuaki Ishizaki
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>>
>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> 
>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>>
>>>
>>> - To
>>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE][SPARK-25299] SPIP: Shuffle Storage API

2019-06-16 Thread Saisai Shao

+1 (binding)

Thanks
Saisai

Imran Rashid  于2019年6月15日周六 上午3:46写道：

> +1 (binding)
>
> I think this is a really important feature for spark.
>
> First, there is already a lot of interest in alternative shuffle storage
> in the community.  There is already a lot of interest in alternative
> shuffle storage, from dynamic allocation in kubernetes, to even just
> improving stability in standard on-premise use of Spark.  However, they're
> often stuck doing this in forks of Spark, and in ways that are not
> maintainable (because they copy-paste many spark internals) or are
> incorrect (for not correctly handling speculative execution & stage
> retries).
>
> Second, I think the specific proposal is good for finding the right
> balance between flexibility and too much complexity, to allow incremental
> improvements.  A lot of work has been put into this already to try to
> figure out which pieces are essential to make alternative shuffle storage
> implementations feasible.
>
> Of course, that means it doesn't include everything imaginable; some
> things still aren't supported, and some will still choose to use the older
> ShuffleManager api to give total control over all of shuffle.  But we know
> there are a reasonable set of things which can be implemented behind the
> api as the first step, and it can continue to evolve.
>
> On Fri, Jun 14, 2019 at 12:13 PM Ilan Filonenko  wrote:
>
>> +1 (non-binding). This API is versatile and flexible enough to handle
>> Bloomberg's internal use-cases. The ability for us to vary implementation
>> strategies is quite appealing. It is also worth to note the minimal changes
>> to Spark core in order to make it work. This is a very much needed addition
>> within the Spark shuffle story.
>>
>> On Fri, Jun 14, 2019 at 9:59 AM bo yang  wrote:
>>
>>> +1 This is great work, allowing plugin of different sort shuffle
>>> write/read implementation! Also great to see it retain the current Spark
>>> configuration
>>> (spark.shuffle.manager=org.apache.spark.shuffle.YourShuffleManagerImpl).
>>>
>>>
>>> On Thu, Jun 13, 2019 at 2:58 PM Matt Cheah  wrote:
>>>
 Hi everyone,



 I would like to call a vote for the SPIP for SPARK-25299
 , which proposes to
 introduce a pluggable storage API for temporary shuffle data.



 You may find the SPIP document here
 
 .



 The discussion thread for the SPIP was conducted here
 
 .



 Please vote on whether or not this proposal is agreeable to you.



 Thanks!



 -Matt Cheah

>>>

Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

2019-06-13 Thread Saisai Shao

I think maybe we could start a vote on this SPIP.

This has been discussed for a while, and the current doc is pretty complete
as for now. Also we saw lots of demands in the community about building
their own shuffle storage.

Thanks
Saisai

Imran Rashid  于2019年6月11日周二 上午3:27写道：

> I would be happy to shepherd this.
>
> On Wed, Jun 5, 2019 at 7:33 PM Matt Cheah  wrote:
>
>> Hi everyone,
>>
>>
>>
>> I wanted to pick this back up again. The discussion has quieted down both
>> on this thread and on the document.
>>
>>
>>
>> We made a few revisions to the document to hopefully make it easier to
>> read and to clarify our criteria for success in the project. Some of the
>> APIs have also been adjusted based on further discussion and things we’ve
>> learned.
>>
>>
>>
>> I was hoping to discuss what our next steps could be here. Specifically,
>>
>>1. Would any PMC be willing to become the shepherd for this SPIP?
>>2. Is there any more feedback regarding this proposal?
>>3. What would we need to do to take this to a voting phase and to
>>begin proposing our work against upstream Spark?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> -Matt Cheah
>>
>>
>>
>> *From: *"Yifei Huang (PD)" 
>> *Date: *Monday, May 13, 2019 at 1:04 PM
>> *To: *Mridul Muralidharan 
>> *Cc: *Bo Yang , Ilan Filonenko , Imran
>> Rashid , Justin Uang , Liang
>> Tang , Marcelo Vanzin , Matei
>> Zaharia , Matt Cheah , Min
>> Shen , Reynold Xin , Ryan Blue <
>> rb...@netflix.com>, Vinoo Ganesh , Will Manning <
>> wmann...@palantir.com>, "b...@fb.com" , "
>> dev@spark.apache.org" , "fel...@uber.com" <
>> fel...@uber.com>, "f...@linkedin.com" , "
>> tgraves...@gmail.com" , "yez...@linkedin.com" <
>> yez...@linkedin.com>, "yue...@memverge.com" 
>> *Subject: *Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API
>>
>>
>>
>> Hi Mridul - thanks for taking the time to give us feedback! Thoughts on
>> the points that you mentioned:
>>
>>
>>
>> The API is meant to work with the existing SortShuffleManager algorithm.
>> There aren't strict requirements on how other ShuffleManager
>> implementations must behave, so it seems impractical to design an API that
>> could also satisfy those unknown requirements. However, we do believe that
>> the API is rather generic, using OutputStreams for writes and InputStreams
>> for reads, and indexing the data by a shuffleId-mapId-reduceId combo, so if
>> other shuffle algorithms treat the data in the same chunks and want an
>> interface for storage, then they can also use this API from within their
>> implementation.
>>
>>
>>
>> About speculative execution, we originally made the assumption that each
>> shuffle task is deterministic, which meant that even if a later mapper
>> overrode a previous committed mapper's value, it's still the same contents.
>> Having searched some tickets and reading
>> https://github.com/apache/spark/pull/22112/files more carefully, I think
>> there are problems with our original thought if the writer writes all
>> attempts of a task to the same location. One example is if the writer
>> implementation writes each partition to the remote host in a sequence of
>> chunks. In such a situation, a reducer might read data half written by the
>> original task and half written by the running speculative task, which will
>> not be the correct contents if the mapper output is unordered. Therefore,
>> writes by a single mapper might have to be transactioned, which is not
>> clear from the API, and seems rather complex to reason about, so we
>> shouldn't expect this from the implementer.
>>
>>
>>
>> However, this doesn't affect the fundamentals of the API since we only
>> need to add an additional attemptId to the storage data index (which can be
>> stored within the MapStatus) to solve the problem of concurrent writes.
>> This would also make it more clear that the writer should use attempt ID as
>> an index to ensure that writes from speculative tasks don't interfere with
>> one another (we can add that to the API docs as well).
>>
>>
>>
>> *From: *Mridul Muralidharan 
>> *Date: *Wednesday, May 8, 2019 at 8:18 PM
>> *To: *"Yifei Huang (PD)" 
>> *Cc: *Bo Yang , Ilan Filonenko , Imran
>> Rashid , Justin Uang , Liang
>> Tang , Marcelo Vanzin , Matei
>> Zaharia , Matt Cheah , Min
>> Shen , Reynold Xin , Ryan Blue <
>> rb...@netflix.com>, Vinoo Ganesh , Will Manning <
>> wmann...@palantir.com>, "b...@fb.com" , "
>> dev@spark.apache.org" , "fel...@uber.com" <
>> fel...@uber.com>, "f...@linkedin.com" , "
>> tgraves...@gmail.com" , "yez...@linkedin.com" <
>> yez...@linkedin.com>, "yue...@memverge.com" 
>> *Subject: *Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API
>>
>>
>>
>>
>>
>> Unfortunately I do not have bandwidth to do a detailed review, but a few
>> things come to mind after a quick read:
>>
>>
>>
>> - While it might be tactically beneficial to align with existing
>> implementation, a clean design which does not tie into existing shuffle
>> implementation would be preferable (if it can be done without

Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API

2019-06-10 Thread Saisai Shao

I'm currently working with MemVerge on the Splash project (one
implementation of remote shuffle storage) and followed this ticket for a
while. I would like to be a shepherd if no one else volunteered to be.

Best regards,
Saisai

Matt Cheah  于2019年6月6日周四 上午8:33写道：

> Hi everyone,
>
>
>
> I wanted to pick this back up again. The discussion has quieted down both
> on this thread and on the document.
>
>
>
> We made a few revisions to the document to hopefully make it easier to
> read and to clarify our criteria for success in the project. Some of the
> APIs have also been adjusted based on further discussion and things we’ve
> learned.
>
>
>
> I was hoping to discuss what our next steps could be here. Specifically,
>
>1. Would any PMC be willing to become the shepherd for this SPIP?
>2. Is there any more feedback regarding this proposal?
>3. What would we need to do to take this to a voting phase and to
>begin proposing our work against upstream Spark?
>
>
>
> Thanks,
>
>
>
> -Matt Cheah
>
>
>
> *From: *"Yifei Huang (PD)" 
> *Date: *Monday, May 13, 2019 at 1:04 PM
> *To: *Mridul Muralidharan 
> *Cc: *Bo Yang , Ilan Filonenko , Imran
> Rashid , Justin Uang , Liang
> Tang , Marcelo Vanzin , Matei
> Zaharia , Matt Cheah , Min
> Shen , Reynold Xin , Ryan Blue <
> rb...@netflix.com>, Vinoo Ganesh , Will Manning <
> wmann...@palantir.com>, "b...@fb.com" , "dev@spark.apache.org"
> , "fel...@uber.com" , "
> f...@linkedin.com" , "tgraves...@gmail.com" <
> tgraves...@gmail.com>, "yez...@linkedin.com" , "
> yue...@memverge.com" 
> *Subject: *Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API
>
>
>
> Hi Mridul - thanks for taking the time to give us feedback! Thoughts on
> the points that you mentioned:
>
>
>
> The API is meant to work with the existing SortShuffleManager algorithm.
> There aren't strict requirements on how other ShuffleManager
> implementations must behave, so it seems impractical to design an API that
> could also satisfy those unknown requirements. However, we do believe that
> the API is rather generic, using OutputStreams for writes and InputStreams
> for reads, and indexing the data by a shuffleId-mapId-reduceId combo, so if
> other shuffle algorithms treat the data in the same chunks and want an
> interface for storage, then they can also use this API from within their
> implementation.
>
>
>
> About speculative execution, we originally made the assumption that each
> shuffle task is deterministic, which meant that even if a later mapper
> overrode a previous committed mapper's value, it's still the same contents.
> Having searched some tickets and reading
> https://github.com/apache/spark/pull/22112/files more carefully, I think
> there are problems with our original thought if the writer writes all
> attempts of a task to the same location. One example is if the writer
> implementation writes each partition to the remote host in a sequence of
> chunks. In such a situation, a reducer might read data half written by the
> original task and half written by the running speculative task, which will
> not be the correct contents if the mapper output is unordered. Therefore,
> writes by a single mapper might have to be transactioned, which is not
> clear from the API, and seems rather complex to reason about, so we
> shouldn't expect this from the implementer.
>
>
>
> However, this doesn't affect the fundamentals of the API since we only
> need to add an additional attemptId to the storage data index (which can be
> stored within the MapStatus) to solve the problem of concurrent writes.
> This would also make it more clear that the writer should use attempt ID as
> an index to ensure that writes from speculative tasks don't interfere with
> one another (we can add that to the API docs as well).
>
>
>
> *From: *Mridul Muralidharan 
> *Date: *Wednesday, May 8, 2019 at 8:18 PM
> *To: *"Yifei Huang (PD)" 
> *Cc: *Bo Yang , Ilan Filonenko , Imran
> Rashid , Justin Uang , Liang
> Tang , Marcelo Vanzin , Matei
> Zaharia , Matt Cheah , Min
> Shen , Reynold Xin , Ryan Blue <
> rb...@netflix.com>, Vinoo Ganesh , Will Manning <
> wmann...@palantir.com>, "b...@fb.com" , "dev@spark.apache.org"
> , "fel...@uber.com" , "
> f...@linkedin.com" , "tgraves...@gmail.com" <
> tgraves...@gmail.com>, "yez...@linkedin.com" , "
> yue...@memverge.com" 
> *Subject: *Re: [DISCUSS][SPARK-25299] SPIP: Shuffle storage API
>
>
>
>
>
> Unfortunately I do not have bandwidth to do a detailed review, but a few
> things come to mind after a quick read:
>
>
>
> - While it might be tactically beneficial to align with existing
> implementation, a clean design which does not tie into existing shuffle
> implementation would be preferable (if it can be done without over
> engineering). Shuffle implementation can change and there are custom
> implementations and experiments which differ quite a bit from what comes
> with Apache Spark.
>
>
>
>
>
> - Please keep speculative execution in mind while designing the
> interfaces: in

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-06 Thread Saisai Shao

Do we have other block/critical issues for Spark 2.4.1 or waiting something
to be fixed? I roughly searched the JIRA, seems there's no block/critical
issues marked for 2.4.1.

Thanks
Saisai

shane knapp  于2019年3月7日周四 上午4:57写道：

> i'll be popping in to the sig-big-data meeting on the 20th to talk about
> stuff like this.
>
> On Wed, Mar 6, 2019 at 12:40 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com> wrote:
>
>> Yes its a touch decision and as we discussed today (
>> https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA
>> )
>> "Kubernetes support window is 9 months, Spark is two years". So we may
>> end up with old client versions on branches still supported like 2.4.x in
>> the future.
>> That gives us no choice but to upgrade, if we want to be on the safe
>> side. We have tested 3.0.0 with 1.11 internally and it works but I dont
>> know what it means to run with old
>> clients.
>>
>>
>> On Wed, Mar 6, 2019 at 7:54 PM Sean Owen  wrote:
>>
>>> If the old client is basically unusable with the versions of K8S
>>> people mostly use now, and the new client still works with older
>>> versions, I could see including this in 2.4.1.
>>>
>>> Looking at
>>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix
>>> it seems like the 4.1.1 client is needed for 1.10 and above. However
>>> it no longer supports 1.7 and below.
>>> We have 3.0.x, and versions through 4.0.x of the client support the
>>> same K8S versions, so no real middle ground here.
>>>
>>> 1.7.0 came out June 2017, it seems. 1.10 was March 2018. Minor release
>>> branches are maintained for 9 months per
>>> https://kubernetes.io/docs/setup/version-skew-policy/
>>>
>>> Spark 2.4.0 came in Nov 2018. I suppose we could say it should have
>>> used the newer client from the start as at that point (?) 1.7 and
>>> earlier were already at least 7 months past EOL.
>>> If we update the client in 2.4.1, versions of K8S as recently
>>> 'supported' as a year ago won't work anymore. I'm guessing there are
>>> still 1.7 users out there? That wasn't that long ago but if the
>>> project and users generally move fast, maybe not.
>>>
>>> Normally I'd say, that's what the next minor release of Spark is for;
>>> update if you want later infra. But there is no Spark 2.5.
>>> I presume downstream distros could modify the dependency easily (?) if
>>> needed and maybe already do. It wouldn't necessarily help end users.
>>>
>>> Does the 3.0.x client not work at all with 1.10+ or just unsupported.
>>> If it 'basically works but no guarantees' I'd favor not updating. If
>>> it doesn't work at all, hm. That's tough. I think I'd favor updating
>>> the client but think it's a tough call both ways.
>>>
>>>
>>>
>>> On Wed, Mar 6, 2019 at 11:14 AM Stavros Kontopoulos
>>>  wrote:
>>> >
>>> > Yes Shane Knapp has done the work for that already,  and also tests
>>> pass, I am working on a PR now, I could submit it for the 2.4 branch .
>>> > I understand that this is a major dependency update, but the problem I
>>> see is that the client version is so old that I dont think it makes
>>> > much sense for current users who are on k8s 1.10, 1.11 etc(
>>> https://github.com/fabric8io/kubernetes-client#compatibility-matrix,
>>> 3.0.0 does not even exist in there).
>>> > I dont know what it means to use that old version with current k8s
>>> clusters in terms of bugs etc.
>>>
>>
>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-03-05 Thread Saisai Shao

Hi DB,

I saw that we already have 6 RCs, but the vote I can search by now was RC2,
were they all canceled?

Thanks
Saisai

DB Tsai  于2019年2月22日周五 上午4:51写道：

> I am cutting a new rc4 with fix from Felix. Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0359BC9965359766
>
> On Thu, Feb 21, 2019 at 8:57 AM Felix Cheung 
> wrote:
> >
> > I merged the fix to 2.4.
> >
> >
> > 
> > From: Felix Cheung 
> > Sent: Wednesday, February 20, 2019 9:34 PM
> > To: DB Tsai; Spark dev list
> > Cc: Cesar Delgado
> > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
> >
> > Could you hold for a bit - I have one more fix to get in
> >
> >
> > 
> > From: d_t...@apple.com on behalf of DB Tsai 
> > Sent: Wednesday, February 20, 2019 12:25 PM
> > To: Spark dev list
> > Cc: Cesar Delgado
> > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
> >
> > Okay. Let's fail rc2, and I'll prepare rc3 with SPARK-26859.
> >
> > DB Tsai | Siri Open Source Technologies [not a contribution] |  Apple,
> Inc
> >
> > > On Feb 20, 2019, at 12:11 PM, Marcelo Vanzin
>  wrote:
> > >
> > > Just wanted to point out that
> > > https://issues.apache.org/jira/browse/SPARK-26859 is not in this RC,
> > > and is marked as a correctness bug. (The fix is in the 2.4 branch,
> > > just not in rc2.)
> > >
> > > On Wed, Feb 20, 2019 at 12:07 PM DB Tsai 
> wrote:
> > >>
> > >> Please vote on releasing the following candidate as Apache Spark
> version 2.4.1.
> > >>
> > >> The vote is open until Feb 24 PST and passes if a majority +1 PMC
> votes are cast, with
> > >> a minimum of 3 +1 votes.
> > >>
> > >> [ ] +1 Release this package as Apache Spark 2.4.1
> > >> [ ] -1 Do not release this package because ...
> > >>
> > >> To learn more about Apache Spark, please see http://spark.apache.org/
> > >>
> > >> The tag to be voted on is v2.4.1-rc2 (commit
> 229ad524cfd3f74dd7aa5fc9ba841ae223caa960):
> > >> https://github.com/apache/spark/tree/v2.4.1-rc2
> > >>
> > >> The release files, including signatures, digests, etc. can be found
> at:
> > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-bin/
> > >>
> > >> Signatures used for Spark RCs can be found in this file:
> > >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >>
> > >> The staging repository for this release can be found at:
> > >>
> https://repository.apache.org/content/repositories/orgapachespark-1299/
> > >>
> > >> The documentation corresponding to this release can be found at:
> > >> https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc2-docs/
> > >>
> > >> The list of bug fixes going into 2.4.1 can be found at the following
> URL:
> > >> https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
> > >>
> > >> FAQ
> > >>
> > >> =
> > >> How can I help test this release?
> > >> =
> > >>
> > >> If you are a Spark user, you can help us test this release by taking
> > >> an existing Spark workload and running on this release candidate, then
> > >> reporting any regressions.
> > >>
> > >> If you're working in PySpark you can set up a virtual env and install
> > >> the current RC and see if anything important breaks, in the Java/Scala
> > >> you can add the staging repository to your projects resolvers and test
> > >> with the RC (make sure to clean up the artifact cache before/after so
> > >> you don't end up building with a out of date RC going forward).
> > >>
> > >> ===
> > >> What should happen to JIRA tickets still targeting 2.4.1?
> > >> ===
> > >>
> > >> The current list of open tickets targeted at 2.4.1 can be found at:
> > >> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.1
> > >>
> > >> Committers should look at those and triage. Extremely important bug
> > >> fixes, documentation, and API tweaks that impact compatibility should
> > >> be worked on immediately. Everything else please retarget to an
> > >> appropriate release.
> > >>
> > >> ==
> > >> But my bug isn't fixed?
> > >> ==
> > >>
> > >> In order to make timely releases, we will typically not hold the
> > >> release unless the bug in question is a regression from the previous
> > >> release. That being said, if there is something which is a regression
> > >> that has not been correctly targeted please ping me or a committer to
> > >> help target the issue.
> > >>
> > >>
> > >> DB Tsai | Siri Open Source Technologies [not a contribution] | 
> Apple, Inc
> > >>
> > >>
> > >> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>
> > >
> > >
> > > --
> > > Marcelo
> > >
> > > -
> > > To unsubscribe e-mail:

Re: Apache Spark 2.2.3 ?

2019-01-01 Thread Saisai Shao

Agreed to have a new branch-2.3 release, as we already accumulated several
fixes.

Thanks
Saisai

Xiao Li  于2019年1月2日周三 下午1:32写道：

> Based on the commit history,
> https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.3
> contains more critical fixes. Maybe the priority is higher?
>
> On Tue, Jan 1, 2019 at 9:22 PM Felix Cheung 
> wrote:
>
>> Speaking of, it’s been 3 months since 2.3.2... (Sept 2018)
>>
>> And 2 months since 2.4.0 (Nov 2018) - does the community feel 2.4 branch
>> is stabilizing?
>>
>>
>> --
>> *From:* Sean Owen 
>> *Sent:* Tuesday, January 1, 2019 8:30 PM
>> *To:* Dongjoon Hyun
>> *Cc:* dev
>> *Subject:* Re: Apache Spark 2.2.3 ?
>>
>> I agree with that logic, and if you're volunteering to do the legwork,
>> I don't see a reason not to cut a final 2.2 release.
>>
>> On Tue, Jan 1, 2019 at 9:19 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Apache Spark community has a policy maintaining the feature branch for
>> 18 months. I think it's time for the 2.2.3 release since 2.2.0 is released
>> on July 2017.
>> >
>> > http://spark.apache.org/versioning-policy.html
>> >
>> > After 2.2.2 (July 2018), `branch-2.2` has 40 patches (including
>> security patches).
>> >
>> >
>> https://gitbox.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-2.2
>> >
>> > If it's okay and there is no further plan on `branch-2.2`, I want to
>> volunteer to prepare the first RC (early next week?).
>> >
>> > Please let me know your opinions about this.
>> >
>> > Bests,
>> > Dongjoon.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> [image:
> https://databricks.com/sparkaisummit/north-america?utm_source=email_medium=signature]
>

Re: What's a blocker?

2018-10-24 Thread Saisai Shao

Just my two cents of the past experience. As a release manager of Spark
2.3.2, I felt significantly delay during the release by block issues. Vote
was failed several times by one or two "block issue". I think during the RC
time, each "block issue" should be carefully evaluated by the related PMCs
and release manager. Some issues which are not so critical or only matters
to one or two firms should be carefully marked as blocker, to avoid the
delay of the release.

Thanks
Saisai

Re: welcome a new batch of committers

2018-10-07 Thread Saisai Shao

Congratulations to all!

Jacek Laskowski  于2018年10月7日周日 上午1:12写道：

> Wow! That's a nice bunch of contributors. Congrats to all new committers.
> I've had tough times to follow all the contributions, but with this crew
> it's gonna be nearly impossible.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Oct 3, 2018 at 10:59 AM Reynold Xin  wrote:
>
>> Hi all,
>>
>> The Apache Spark PMC has recently voted to add several new committers to
>> the project, for their contributions:
>>
>> - Shane Knapp (contributor to infra)
>> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
>> - Kazuaki Ishizaki (contributor to Spark SQL)
>> - Xingbo Jiang (contributor to Spark Core and SQL)
>> - Yinan Li (contributor to Spark on Kubernetes)
>> - Takeshi Yamamuro (contributor to Spark SQL)
>>
>> Please join me in welcoming them!
>>
>>

Re: SPIP: Support Kafka delegation token in Structured Streaming

2018-09-29 Thread Saisai Shao

I like this proposal. Since Kafka already provides delegation token
mechanism, we can also leverage Spark's delegation token framework to add
Kafka as a built-in support.

BTW I think there's no much difference in support structured streaming and
DStream, maybe we can set both as goal.

Thanks
Saisai

Gabor Somogyi  于2018年9月27日周四 下午7:58写道：

> Hi all,
>
> I am writing this e-mail in order to discuss the delegation token support
> for kafka feature which is reported in SPARK-25501
> . I've prepared a SPIP
> 
>  for
> it. PR is on the way...
>
> Looking forward to hear your feedback.
>
> BR,
> G
>
>

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-09-27 Thread Saisai Shao

Only "without-hadoop" profile has 2.12 binary, is it expected?

Thanks
Saisai

Wenchen Fan  于2018年9月28日周五 上午11:08写道：

> I'm adding my own +1, since all the problems mentioned in the RC1 voting
> email are all resolved. And there is no blocker issue for 2.4.0 AFAIK.
>
> On Fri, Sep 28, 2018 at 10:59 AM Wenchen Fan  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.0.
>>
>> The vote is open until October 1 PST and passes if a majority +1 PMC
>> votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.4.0-rc2 (commit
>> 42f25f309e91c8cde1814e3720099ac1e64783da):
>> https://github.com/apache/spark/tree/v2.4.0-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1287
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc2-docs/
>>
>> The list of bug fixes going into 2.4.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.0?
>> ===
>>
>> The current list of open tickets targeted at 2.4.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>

[ANNOUNCE] Announcing Apache Spark 2.3.2

2018-09-26 Thread Saisai Shao

We are happy to announce the availability of Spark 2.3.2!

Apache Spark 2.3.2 is a maintenance release, based on the branch-2.3
maintenance branch of Spark. We strongly recommend all 2.3.x users to
upgrade to this stable release.

To download Spark 2.3.2, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-3-2.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.


Best regards
Saisai

[VOTE][RESULT] Spark 2.3.2 (RC6)

2018-09-23 Thread Saisai Shao

The vote passes. Thanks to all who helped with the release!

I'll follow up later with a release announcement once everything is
published.

+1 (* = binding):

Sea Owen*
Wenchen Fan*
Saisai Shao
Dongjoon Hyun
Takeshi Yamamuro
John Zhuge
Xiao Li*
Denny Lee
Ryan Blue
Michael Heuer

+0: None

-1: None

Thanks
Saisai

UNCHECKED Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Saisai Shao

Hi Marco,

>From my understanding of SPARK-25454, I don't think it is a block issue, it
might be an corner case, so personally I don't want to block the release of
2.3.2 because of this issue. The release has been delayed for a long time.

Thanks
Saisai


Marco Gaido  于2018年9月19日周三 下午2:58写道：

> Sorry, I am -1 because of SPARK-25454 which is a regression from 2.2.
>
> Il giorno mer 19 set 2018 alle ore 03:45 Dongjoon Hyun <
> dongjoon.h...@gmail.com> ha scritto:
>
>> +1.
>>
>> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>>
>> I hit the following test case failure once during testing, but it's not
>> persistent.
>>
>> KafkaContinuousSourceSuite
>> ...
>> subscribing topic by name from earliest offsets (failOnDataLoss:
>> false) *** FAILED ***
>>
>> Thank you, Saisai.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
>> wrote:
>>
>>> +1 from my own side.
>>>
>>> Thanks
>>> Saisai
>>>
>>> Wenchen Fan  于2018年9月18日周二 上午9:34写道：
>>>
>>>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>>>
>>>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>>>>
>>>>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>>>>> build from source with most profiles passed for me.
>>>>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>>>>> wrote:
>>>>> >
>>>>> > Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.3.2.
>>>>> >
>>>>> > The vote is open until September 21 PST and passes if a majority +1
>>>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>>> >
>>>>> > [ ] +1 Release this package as Apache Spark 2.3.2
>>>>> > [ ] -1 Do not release this package because ...
>>>>> >
>>>>> > To learn more about Apache Spark, please see
>>>>> http://spark.apache.org/
>>>>> >
>>>>> > The tag to be voted on is v2.3.2-rc6 (commit
>>>>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>>>>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>>>>> >
>>>>> > The release files, including signatures, digests, etc. can be found
>>>>> at:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>>>>> >
>>>>> > Signatures used for Spark RCs can be found in this file:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>> >
>>>>> > The staging repository for this release can be found at:
>>>>> >
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1286/
>>>>> >
>>>>> > The documentation corresponding to this release can be found at:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>>>>> >
>>>>> > The list of bug fixes going into 2.3.2 can be found at the following
>>>>> URL:
>>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>>>> >
>>>>> >
>>>>> > FAQ
>>>>> >
>>>>> > =
>>>>> > How can I help test this release?
>>>>> > =
>>>>> >
>>>>> > If you are a Spark user, you can help us test this release by taking
>>>>> > an existing Spark workload and running on this release candidate,
>>>>> then
>>>>> > reporting any regressions.
>>>>> >
>>>>> > If you're working in PySpark you can set up a virtual env and install
>>>>> > the current RC and see if anything important breaks, in the
>>>>> Java/Scala
>>>>> > you can add the staging repository to your projects resolvers and
>>>>> test
>>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>>> > you don't end up building with a out of date RC going forward).
>>>>> >
>>>>> > ===
>>>>> > What should happen to JIRA tickets still targeting 2.3.2?
>>>>> > ===
>>>>> >
>>>>> > The current list of open tickets targeted at 2.3.2 can be found at:
>>>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>>>> "Target Version/s" = 2.3.2
>>>>> >
>>>>> > Committers should look at those and triage. Extremely important bug
>>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>>> > be worked on immediately. Everything else please retarget to an
>>>>> > appropriate release.
>>>>> >
>>>>> > ==
>>>>> > But my bug isn't fixed?
>>>>> > ==
>>>>> >
>>>>> > In order to make timely releases, we will typically not hold the
>>>>> > release unless the bug in question is a regression from the previous
>>>>> > release. That being said, if there is something which is a regression
>>>>> > that has not been correctly targeted please ping me or a committer to
>>>>> > help target the issue.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>

UNCHECKED Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Saisai Shao

Hi Marco,

>From my understanding of SPARK-25454, I don't think it is a block issue, it
might be an corner case, so personally I don't want to block the release of
2.3.2 because of this issue. The release has been delayed for a long time.

Marco Gaido  于2018年9月19日周三 下午2:58写道：

> Sorry, I am -1 because of SPARK-25454 which is a regression from 2.2.
>
> Il giorno mer 19 set 2018 alle ore 03:45 Dongjoon Hyun <
> dongjoon.h...@gmail.com> ha scritto:
>
>> +1.
>>
>> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>>
>> I hit the following test case failure once during testing, but it's not
>> persistent.
>>
>> KafkaContinuousSourceSuite
>> ...
>> subscribing topic by name from earliest offsets (failOnDataLoss:
>> false) *** FAILED ***
>>
>> Thank you, Saisai.
>>
>> Bests,
>> Dongjoon.
>>
>> On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
>> wrote:
>>
>>> +1 from my own side.
>>>
>>> Thanks
>>> Saisai
>>>
>>> Wenchen Fan  于2018年9月18日周二 上午9:34写道：
>>>
>>>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>>>
>>>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>>>>
>>>>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>>>>> build from source with most profiles passed for me.
>>>>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>>>>> wrote:
>>>>> >
>>>>> > Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.3.2.
>>>>> >
>>>>> > The vote is open until September 21 PST and passes if a majority +1
>>>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>>> >
>>>>> > [ ] +1 Release this package as Apache Spark 2.3.2
>>>>> > [ ] -1 Do not release this package because ...
>>>>> >
>>>>> > To learn more about Apache Spark, please see
>>>>> http://spark.apache.org/
>>>>> >
>>>>> > The tag to be voted on is v2.3.2-rc6 (commit
>>>>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>>>>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>>>>> >
>>>>> > The release files, including signatures, digests, etc. can be found
>>>>> at:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>>>>> >
>>>>> > Signatures used for Spark RCs can be found in this file:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>> >
>>>>> > The staging repository for this release can be found at:
>>>>> >
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1286/
>>>>> >
>>>>> > The documentation corresponding to this release can be found at:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>>>>> >
>>>>> > The list of bug fixes going into 2.3.2 can be found at the following
>>>>> URL:
>>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>>>> >
>>>>> >
>>>>> > FAQ
>>>>> >
>>>>> > =
>>>>> > How can I help test this release?
>>>>> > =
>>>>> >
>>>>> > If you are a Spark user, you can help us test this release by taking
>>>>> > an existing Spark workload and running on this release candidate,
>>>>> then
>>>>> > reporting any regressions.
>>>>> >
>>>>> > If you're working in PySpark you can set up a virtual env and install
>>>>> > the current RC and see if anything important breaks, in the
>>>>> Java/Scala
>>>>> > you can add the staging repository to your projects resolvers and
>>>>> test
>>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>>> > you don't end up building with a out of date RC going forward).
>>>>> >
>>>>> > ===
>>>>> > What should happen to JIRA tickets still targeting 2.3.2?
>>>>> > ===
>>>>> >
>>>>> > The current list of open tickets targeted at 2.3.2 can be found at:
>>>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>>>> "Target Version/s" = 2.3.2
>>>>> >
>>>>> > Committers should look at those and triage. Extremely important bug
>>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>>> > be worked on immediately. Everything else please retarget to an
>>>>> > appropriate release.
>>>>> >
>>>>> > ==
>>>>> > But my bug isn't fixed?
>>>>> > ==
>>>>> >
>>>>> > In order to make timely releases, we will typically not hold the
>>>>> > release unless the bug in question is a regression from the previous
>>>>> > release. That being said, if there is something which is a regression
>>>>> > that has not been correctly targeted please ping me or a committer to
>>>>> > help target the issue.
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-17 Thread Saisai Shao

+1 from my own side.

Thanks
Saisai

Wenchen Fan  于2018年9月18日周二 上午9:34写道：

> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>
> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen  wrote:
>
>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>> build from source with most profiles passed for me.
>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.3.2.
>> >
>> > The vote is open until September 21 PST and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.3.2
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.3.2-rc6 (commit
>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1286/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/
>> >
>> > The list of bug fixes going into 2.3.2 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12343289
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.3.2?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.3.2 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.3.2
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-17 Thread Saisai Shao

Hi Wenchen,

I think you need to set SPHINXPYTHON to python3 before building the docs,
to workaround the doc issue (
https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
).

Here is the notes for release page:


>- Ensure you have Python 3 having Sphinx installed, and SPHINXPYTHON 
> environment
>variable is set to indicate your Python 3 executable (see SPARK-24530).
>
>
Wenchen Fan  于2018年9月17日周一 上午10:48写道：

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until September 20 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc1 (commit
> 1220ab8a0738b5f67dc522df5e3e77ffc83d207a):
> https://github.com/apache/spark/tree/v2.4.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1285/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc1-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/2.4.0
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>

[VOTE] SPARK 2.3.2 (RC6)

2018-09-17 Thread Saisai Shao

Please vote on releasing the following candidate as Apache Spark version
2.3.2.

The vote is open until September 21 PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.2-rc6 (commit
02b510728c31b70e6035ad541bfcdc2b59dcd79a):
https://github.com/apache/spark/tree/v2.3.2-rc6

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1286/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-docs/

The list of bug fixes going into 2.3.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343289


FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.2?
===

The current list of open tickets targeted at 2.3.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.3.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: [VOTE] SPARK 2.3.2 (RC5)

2018-09-06 Thread Saisai Shao

Hi,

PMC members asked me to hold a bit while they're dealing with some other
things. Please wait for a bit while.

Thanks
Saisai


zzc <441586...@qq.com> 于2018年9月6日周四 下午4:27写道：

> Hi Saisai:
>   Spark 2.4 was cut, and is there any new process on 2.3.2?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Persisting driver logs in yarn client mode (SPARK-25118)

2018-08-21 Thread Saisai Shao

One issue I can think of is that this "moving the driver log" in the
application end is quite time-consuming, which will significantly delay the
shutdown. We already suffered such "rename" problem for event log on object
store, the moving of driver log will make the problem severe.

For a vanilla Spark on yarn client application, I think user could redirect
the console outputs to log and provides both driver log and yarn
application log to the customers, this seems not a big overhead.

Just my two cents.

Thanks
Saisai

Ankur Gupta  于2018年8月22日周三 上午5:19写道：

> Hi all,
>
> I want to highlight a problem that we face here at Cloudera and start a
> discussion on how to go about solving it.
>
> *Problem Statement:*
> Our customers reach out to us when they face problems in their Spark
> Applications. Those problems can be related to Spark, environment issues,
> their own code or something else altogether. A lot of times these customers
> run their Spark Applications in Yarn Client mode, which as we all know,
> uses a ConsoleAppender to print logs to the console. These customers
> usually send their Yarn logs to us to troubleshoot. As you may have
> figured, these logs do not contain driver logs and makes it difficult for
> us to troubleshoot the issue. In that scenario our customers end up running
> the application again, piping the output to a log file or using a local log
> appender and then sending over that file.
>
> I believe that there are other users in the community who also face
> similar problem, where the central team managing Spark clusters face
> difficulty in helping the end users because they ran their application in
> shell or yarn client mode (I am not sure what is the equivalent in Mesos).
>
> Additionally, there may be teams who want to capture all these logs so
> that they can be analyzed at some later point in time and the fact that
> driver logs are not a part of Yarn Logs causes them to capture only partial
> logs or makes it difficult to capture all the logs.
>
> *Proposed Solution:*
> One "low touch" approach will be to create an ApplicationListener which
> listens for Application Start and Application End events. On Application
> Start, this listener will append a Log Appender which writes to a local or
> remote (eg:hdfs) log file in an application specific directory and moves
> this to Yarn's Remote Application Dir (or equivalent Mesos Dir) on
> application end. This way the logs will be available as part of Yarn Logs.
>
> I am also interested in hearing about other ideas that the community may
> have about this. Or if someone has already solved this problem, then I
> would like them to contribute their solution to the community.
>
> Thanks,
> Ankur
>

Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Saisai Shao

There's still another one SPARK-25114.

I will wait for several days in case some other blocks jumped.

Thanks
Saisai



Wenchen Fan  于2018年8月15日周三 上午10:19写道：

> SPARK-25051 is resolved, can we start a new RC?
>
> SPARK-16406 is an improvement, generally we should not backport.
>
> On Wed, Aug 15, 2018 at 5:16 AM Sean Owen  wrote:
>
>> (We wouldn't consider lack of an improvement to block a maintenance
>> release. It's reasonable to raise this elsewhere as a big nice to have on
>> 2.3.x in general)
>>
>> On Tue, Aug 14, 2018, 4:13 PM antonkulaga  wrote:
>>
>>> -1 as https://issues.apache.org/jira/browse/SPARK-16406 does not seem
>>> to be
>>> back-ported to 2.3.1 and it causes a lot of pain
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

[VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Saisai Shao

Please vote on releasing the following candidate as Apache Spark version
2.3.2.

The vote is open until August 20 PST and passes if a majority +1 PMC votes
are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.2-rc5 (commit
4dc82259d81102e0cb48f4cb2e8075f80d899ac4):
https://github.com/apache/spark/tree/v2.3.2-rc5

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc5-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1281/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc5-docs/

The list of bug fixes going into 2.3.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343289

Note. RC4 was cancelled because of one blocking issue SPARK-25084 during
release preparation.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.2?
===

The current list of open tickets targeted at 2.3.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.3.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-08-06 Thread Saisai Shao

Yes, there'll be an RC4, still waiting for the fix of one issue.

Yuval Itzchakov  于2018年8月6日周一 下午6:10写道：

> Are there any plans to create an RC4? There's an important Kafka Source
> leak
> fix I've merged back to the 2.3 branch.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-29 Thread Saisai Shao

Sure, I will do a next RC. I'm still waiting for a CVE fix, if this can be
done in this two days, I will also include that one.

Xiao Li  于2018年7月28日周六 上午12:05写道：

> The following blocker/important fixes have been merged to Spark 2.3 branch:
>
> https://issues.apache.org/jira/browse/SPARK-24927
> https://issues.apache.org/jira/browse/SPARK-24867
> https://issues.apache.org/jira/browse/SPARK-24891
>
> *Saisai*, could you start the next RC?
>
> Thanks,
>
> Xiao
>
>
> 2018-07-20 14:21 GMT-07:00 Tom Graves :
>
>> fyi, I merged in a couple jira that were critical (and I thought would be
>> good to include in the next release) that if we spin another RC will get
>> included, we should update the jira SPARK-24755
>> <https://github.com/apache/spark/commit/d0280ab818391fd11662647459f1e9e683b2bc8e>
>>  and SPARK-24677
>> <https://github.com/apache/spark/commit/7be70e29dd92de36dbb30ce39623d588f48e4cac>,
>> if anyone disagrees we could back those out but I think they would be good
>> to include.
>>
>> Tom
>>
>> On Thursday, July 19, 2018, 8:13:23 PM CDT, Saisai Shao <
>> sai.sai.s...@gmail.com> wrote:
>>
>>
>> Sure, I can wait for this and create another RC then.
>>
>> Thanks,
>> Saisai
>>
>> Xiao Li  于2018年7月20日周五 上午9:11写道：
>>
>> Yes. https://issues.apache.org/jira/browse/SPARK-24867 is the one I
>> created. The PR has been created. Since this is not rare, let us merge it
>> to 2.3.2?
>>
>> Reynold' PR is to get rid of AnalysisBarrier. That is better than
>> multiple patches we added for AnalysisBarrier after 2.3.0 release. We can
>> target it to 2.4.
>>
>> Thanks,
>>
>> Xiao
>>
>> 2018-07-19 17:48 GMT-07:00 Saisai Shao :
>>
>> I see, thanks Reynold.
>>
>> Reynold Xin  于2018年7月20日周五 上午8:46写道：
>>
>> Looking at the list of pull requests it looks like this is the ticket:
>> https://issues.apache.org/jira/browse/SPARK-24867
>>
>>
>>
>> On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin  wrote:
>>
>> I don't think my ticket should block this release. It's a big general
>> refactoring.
>>
>> Xiao do you have a ticket for the bug you found?
>>
>>
>> On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao 
>> wrote:
>>
>> Hi Xiao,
>>
>> Are you referring to this JIRA (
>> https://issues.apache.org/jira/browse/SPARK-24865)?
>>
>> Xiao Li  于2018年7月20日周五 上午2:41写道：
>>
>> dfWithUDF.cache()
>> dfWithUDF.write.saveAsTable("t")
>> dfWithUDF.write.saveAsTable("t1")
>>
>>
>> Cached data is not being used. It causes a big performance regression.
>>
>>
>>
>>
>> 2018-07-19 11:32 GMT-07:00 Sean Owen :
>>
>> What regression are you referring to here? A -1 vote really needs a
>> rationale.
>>
>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:
>>
>> I would first vote -1.
>>
>> I might find another regression caused by the analysis barrier. Will keep
>> you posted.
>>
>>
>>
>>
>

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Saisai Shao

Sure, I can wait for this and create another RC then.

Thanks,
Saisai

Xiao Li  于2018年7月20日周五 上午9:11写道：

> Yes. https://issues.apache.org/jira/browse/SPARK-24867 is the one I
> created. The PR has been created. Since this is not rare, let us merge it
> to 2.3.2?
>
> Reynold' PR is to get rid of AnalysisBarrier. That is better than multiple
> patches we added for AnalysisBarrier after 2.3.0 release. We can target it
> to 2.4.
>
> Thanks,
>
> Xiao
>
> 2018-07-19 17:48 GMT-07:00 Saisai Shao :
>
>> I see, thanks Reynold.
>>
>> Reynold Xin  于2018年7月20日周五 上午8:46写道：
>>
>>> Looking at the list of pull requests it looks like this is the ticket:
>>> https://issues.apache.org/jira/browse/SPARK-24867
>>>
>>>
>>>
>>> On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin  wrote:
>>>
>>>> I don't think my ticket should block this release. It's a big general
>>>> refactoring.
>>>>
>>>> Xiao do you have a ticket for the bug you found?
>>>>
>>>>
>>>> On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao 
>>>> wrote:
>>>>
>>>>> Hi Xiao,
>>>>>
>>>>> Are you referring to this JIRA (
>>>>> https://issues.apache.org/jira/browse/SPARK-24865)?
>>>>>
>>>>> Xiao Li  于2018年7月20日周五 上午2:41写道：
>>>>>
>>>>>> dfWithUDF.cache()
>>>>>> dfWithUDF.write.saveAsTable("t")
>>>>>> dfWithUDF.write.saveAsTable("t1")
>>>>>>
>>>>>>
>>>>>> Cached data is not being used. It causes a big performance
>>>>>> regression.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2018-07-19 11:32 GMT-07:00 Sean Owen :
>>>>>>
>>>>>>> What regression are you referring to here? A -1 vote really needs a
>>>>>>> rationale.
>>>>>>>
>>>>>>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would first vote -1.
>>>>>>>>
>>>>>>>> I might find another regression caused by the analysis barrier.
>>>>>>>> Will keep you posted.
>>>>>>>>
>>>>>>>>
>>>>>>
>

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Saisai Shao

I see, thanks Reynold.

Reynold Xin  于2018年7月20日周五 上午8:46写道：

> Looking at the list of pull requests it looks like this is the ticket:
> https://issues.apache.org/jira/browse/SPARK-24867
>
>
>
> On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin  wrote:
>
>> I don't think my ticket should block this release. It's a big general
>> refactoring.
>>
>> Xiao do you have a ticket for the bug you found?
>>
>>
>> On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao 
>> wrote:
>>
>>> Hi Xiao,
>>>
>>> Are you referring to this JIRA (
>>> https://issues.apache.org/jira/browse/SPARK-24865)?
>>>
>>> Xiao Li  于2018年7月20日周五 上午2:41写道：
>>>
>>>> dfWithUDF.cache()
>>>> dfWithUDF.write.saveAsTable("t")
>>>> dfWithUDF.write.saveAsTable("t1")
>>>>
>>>>
>>>> Cached data is not being used. It causes a big performance regression.
>>>>
>>>>
>>>>
>>>>
>>>> 2018-07-19 11:32 GMT-07:00 Sean Owen :
>>>>
>>>>> What regression are you referring to here? A -1 vote really needs a
>>>>> rationale.
>>>>>
>>>>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:
>>>>>
>>>>>> I would first vote -1.
>>>>>>
>>>>>> I might find another regression caused by the analysis barrier. Will
>>>>>> keep you posted.
>>>>>>
>>>>>>
>>>>

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Saisai Shao

Hi Xiao,

Are you referring to this JIRA (
https://issues.apache.org/jira/browse/SPARK-24865)?

Xiao Li  于2018年7月20日周五 上午2:41写道：

> dfWithUDF.cache()
> dfWithUDF.write.saveAsTable("t")
> dfWithUDF.write.saveAsTable("t1")
>
>
> Cached data is not being used. It causes a big performance regression.
>
>
>
>
> 2018-07-19 11:32 GMT-07:00 Sean Owen :
>
>> What regression are you referring to here? A -1 vote really needs a
>> rationale.
>>
>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:
>>
>>> I would first vote -1.
>>>
>>> I might find another regression caused by the analysis barrier. Will
>>> keep you posted.
>>>
>>>
>

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-16 Thread Saisai Shao

I will put my +1 on this RC.

For the test failure fix, I will include it if there's another RC.

Sean Owen  于2018年7月16日周一 下午10:47写道：

> OK, hm, will try to get to the bottom of it. But if others can build this
> module successfully, I give a +1 . The test failure is inevitable here and
> should not block release.
>
> On Sun, Jul 15, 2018 at 9:39 PM Saisai Shao 
> wrote:
>
>> Hi Sean,
>>
>> I just did a clean build with mvn/sbt on 2.3.2, I didn't meet the errors
>> you pasted here. I'm not sure how it happens.
>>
>> Sean Owen  于2018年7月16日周一 上午6:30写道：
>>
>>> Looks good to me, with the following caveats.
>>>
>>> First see the discussion on
>>> https://issues.apache.org/jira/browse/SPARK-24813 ; the
>>> flaky HiveExternalCatalogVersionsSuite will probably fail all the time
>>> right now. That's not a regression and is a test-only issue, so don't think
>>> it must block the release. However if this fix holds up, and we need
>>> another RC, worth pulling in for sure.
>>>
>>> Also is anyone seeing this while building and testing the Spark SQL +
>>> Kafka module? I see this error even after a clean rebuild. I sort of get
>>> what the error is saying but can't figure out why it would only happen at
>>> test/runtime. Haven't seen it before.
>>>
>>> [error] missing or invalid dependency detected while loading class file
>>> 'MetricsSystem.class'.
>>>
>>> [error] Could not access term eclipse in package org,
>>>
>>> [error] because it (or its dependencies) are missing. Check your build
>>> definition for
>>>
>>> [error] missing or conflicting dependencies. (Re-run with
>>> `-Ylog-classpath` to see the problematic classpath.)
>>>
>>> [error] A full rebuild may help if 'MetricsSystem.class' was compiled
>>> against an incompatible version of org.
>>>
>>> [error] missing or invalid dependency detected while loading class file
>>> 'MetricsSystem.class'.
>>>
>>> [error] Could not access term jetty in value org.eclipse,
>>>
>>> [error] because it (or its dependencies) are missing. Check your build
>>> definition for
>>>
>>> [error] missing or conflicting dependencies. (Re-run with
>>> `-Ylog-classpath` to see the problematic classpath.)
>>>
>>> [error] A full rebuild may help if 'MetricsSystem.class' was compiled
>>> against an incompatible version of org.eclipse
>>>
>>> On Sun, Jul 15, 2018 at 3:09 AM Saisai Shao 
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.3.2.
>>>>
>>>> The vote is open until July 20 PST and passes if a majority +1 PMC
>>>> votes are cast, with a minimum of 3 +1 votes.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.3.2
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v2.3.2-rc3
>>>> (commit b3726dadcf2997f20231873ec6e057dba433ae64):
>>>> https://github.com/apache/spark/tree/v2.3.2-rc3
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc3-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1278/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc3-docs/
>>>>
>>>> The list of bug fixes going into 2.3.2 can be found at the following
>>>> URL:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>>>
>>>> Note. RC2 was cancelled because of one blocking issue SPARK-24781
>>>> during release preparation.
>>>>
>>>> FAQ
>>>>
>>>> =
>>>> How can I help test this release?
>>>> =
>>>>
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-15 Thread Saisai Shao

Hi Sean,

I just did a clean build with mvn/sbt on 2.3.2, I didn't meet the errors
you pasted here. I'm not sure how it happens.

Sean Owen  于2018年7月16日周一 上午6:30写道：

> Looks good to me, with the following caveats.
>
> First see the discussion on
> https://issues.apache.org/jira/browse/SPARK-24813 ; the
> flaky HiveExternalCatalogVersionsSuite will probably fail all the time
> right now. That's not a regression and is a test-only issue, so don't think
> it must block the release. However if this fix holds up, and we need
> another RC, worth pulling in for sure.
>
> Also is anyone seeing this while building and testing the Spark SQL +
> Kafka module? I see this error even after a clean rebuild. I sort of get
> what the error is saying but can't figure out why it would only happen at
> test/runtime. Haven't seen it before.
>
> [error] missing or invalid dependency detected while loading class file
> 'MetricsSystem.class'.
>
> [error] Could not access term eclipse in package org,
>
> [error] because it (or its dependencies) are missing. Check your build
> definition for
>
> [error] missing or conflicting dependencies. (Re-run with
> `-Ylog-classpath` to see the problematic classpath.)
>
> [error] A full rebuild may help if 'MetricsSystem.class' was compiled
> against an incompatible version of org.
>
> [error] missing or invalid dependency detected while loading class file
> 'MetricsSystem.class'.
>
> [error] Could not access term jetty in value org.eclipse,
>
> [error] because it (or its dependencies) are missing. Check your build
> definition for
>
> [error] missing or conflicting dependencies. (Re-run with
> `-Ylog-classpath` to see the problematic classpath.)
>
> [error] A full rebuild may help if 'MetricsSystem.class' was compiled
> against an incompatible version of org.eclipse
>
> On Sun, Jul 15, 2018 at 3:09 AM Saisai Shao 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.3.2.
>>
>> The vote is open until July 20 PST and passes if a majority +1 PMC votes
>> are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.3.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.3.2-rc3
>> (commit b3726dadcf2997f20231873ec6e057dba433ae64):
>> https://github.com/apache/spark/tree/v2.3.2-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1278/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc3-docs/
>>
>> The list of bug fixes going into 2.3.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>
>> Note. RC2 was cancelled because of one blocking issue SPARK-24781 during
>> release preparation.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.3.2?
>> ===
>>
>> The current list of open tickets targeted at 2.3.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.3.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>

[VOTE] SPARK 2.3.2 (RC3)

2018-07-15 Thread Saisai Shao

Please vote on releasing the following candidate as Apache Spark version
2.3.2.

The vote is open until July 20 PST and passes if a majority +1 PMC votes
are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.2-rc3
(commit b3726dadcf2997f20231873ec6e057dba433ae64):
https://github.com/apache/spark/tree/v2.3.2-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1278/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc3-docs/

The list of bug fixes going into 2.3.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343289

Note. RC2 was cancelled because of one blocking issue SPARK-24781 during
release preparation.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.2?
===

The current list of open tickets targeted at 2.3.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.3.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-11 Thread Saisai Shao

Hi Sean,

The doc for RC1 is not usable because of sphinx issue. It should be rebuilt
with python3 to avoid the issue. Also there's one more blocking issue in
SQL, so I will wait for that to cut a new RC.

Sean Owen  于2018年7月12日周四 上午9:05写道：

> I guess my question is just whether the Python docs are usable or not in
> this RC. They looked reasonable to me but I don't know enough to know what
> the issue was. If the result is usable, then there's no problem here, even
> if something could be fixed/improved later.
>
> On Sun, Jul 8, 2018 at 7:25 PM Saisai Shao  wrote:
>
>> Hi Sean,
>>
>> SPARK-24530 is not included in this RC1 release. Actually I'm so familiar
>> with this issue so still using python2 to generate docs.
>>
>> In the JIRA it mentioned that python3 with sphinx could workaround this
>> issue. @Hyukjin Kwon  would you please help to
>> clarify?
>>
>> Thanks
>> Saisai
>>
>>
>> Xiao Li  于2018年7月9日周一 上午1:59写道：
>>
>>> Three business days might be too short. Let us open the vote until the
>>> end of this Friday (July 13th)?
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> 2018-07-08 10:15 GMT-07:00 Sean Owen :
>>>
>>>> Just checking that the doc issue in
>>>> https://issues.apache.org/jira/browse/SPARK-24530 is worked around in
>>>> this release?
>>>>
>>>> This was pointed out as an example of a broken doc:
>>>>
>>>> https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>>>
>>>> Here it is in 2.3.2 RC1:
>>>>
>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>>>
>>>> It wasn't immediately obvious to me whether this addressed the issue
>>>> that was identified or not.
>>>>
>>>>
>>>> Otherwise nothing is open for 2.3.2, sigs and license look good, tests
>>>> pass as last time, etc.
>>>>
>>>> +1
>>>>
>>>> On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao 
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.3.2.
>>>>>
>>>>> The vote is open until July 11th PST and passes if a majority +1 PMC
>>>>> votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.3.2
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.3.2-rc1
>>>>> (commit 4df06b45160241dbb331153efbb25703f913c192):
>>>>> https://github.com/apache/spark/tree/v2.3.2-rc1
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1277/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>>>>>
>>>>> The list of bug fixes going into 2.3.2 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>>>>
>>>>> PS. This is my first time to do release, please help to check if
>>>>> everything is landing correctly. Thanks ^-^
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything impor

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao

Thanks @Hyukjin Kwon  . Yes I'm using python2 to build
docs, looks like Python2 with Sphinx has issues.

What is the pending thing for this PR (
https://github.com/apache/spark/pull/21659)? I'm planning to cut RC2 once
this is merged, do you an ETA for this PR?

Hyukjin Kwon  于2018年7月9日周一 上午9:06写道：

> Seems Python 2's Sphinx was used -
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
> and SPARK-24530 issue exists in the RC. it's kind of tricky to manually
> verify if Python 3 is used given my few tries in my local.
>
> I think the fix against SPARK-24530 is technically not merged yet;
> however, I don't think this blocks the release like the previous release. I
> think we could proceed in parallel.
> Will probably make a progress on
> https://github.com/apache/spark/pull/21659, and fix the release doc too.
>
>
> 2018년 7월 9일 (월) 오전 8:25, Saisai Shao 님이 작성:
>
>> Hi Sean,
>>
>> SPARK-24530 is not included in this RC1 release. Actually I'm so familiar
>> with this issue so still using python2 to generate docs.
>>
>> In the JIRA it mentioned that python3 with sphinx could workaround this
>> issue. @Hyukjin Kwon  would you please help to
>> clarify?
>>
>> Thanks
>> Saisai
>>
>>
>> Xiao Li  于2018年7月9日周一 上午1:59写道：
>>
>>> Three business days might be too short. Let us open the vote until the
>>> end of this Friday (July 13th)?
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> 2018-07-08 10:15 GMT-07:00 Sean Owen :
>>>
>>>> Just checking that the doc issue in
>>>> https://issues.apache.org/jira/browse/SPARK-24530 is worked around in
>>>> this release?
>>>>
>>>> This was pointed out as an example of a broken doc:
>>>>
>>>> https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>>>
>>>> Here it is in 2.3.2 RC1:
>>>>
>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>>>
>>>> It wasn't immediately obvious to me whether this addressed the issue
>>>> that was identified or not.
>>>>
>>>>
>>>> Otherwise nothing is open for 2.3.2, sigs and license look good, tests
>>>> pass as last time, etc.
>>>>
>>>> +1
>>>>
>>>> On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao 
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.3.2.
>>>>>
>>>>> The vote is open until July 11th PST and passes if a majority +1 PMC
>>>>> votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.3.2
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.3.2-rc1
>>>>> (commit 4df06b45160241dbb331153efbb25703f913c192):
>>>>> https://github.com/apache/spark/tree/v2.3.2-rc1
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1277/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>>>>>
>>>>> The list of bug fixes going into 2.3.2 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>>>>
>>>>> PS. This is my first time to do release, please help to check if
>>>>> everything is landing correctly. Thanks ^-^
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
&g

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao

Hi Sean,

SPARK-24530 is not included in this RC1 release. Actually I'm so familiar
with this issue so still using python2 to generate docs.

In the JIRA it mentioned that python3 with sphinx could workaround this
issue. @Hyukjin Kwon  would you please help to clarify?

Thanks
Saisai


Xiao Li  于2018年7月9日周一 上午1:59写道：

> Three business days might be too short. Let us open the vote until the end
> of this Friday (July 13th)?
>
> Cheers,
>
> Xiao
>
> 2018-07-08 10:15 GMT-07:00 Sean Owen :
>
>> Just checking that the doc issue in
>> https://issues.apache.org/jira/browse/SPARK-24530 is worked around in
>> this release?
>>
>> This was pointed out as an example of a broken doc:
>>
>> https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>
>> Here it is in 2.3.2 RC1:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>
>> It wasn't immediately obvious to me whether this addressed the issue that
>> was identified or not.
>>
>>
>> Otherwise nothing is open for 2.3.2, sigs and license look good, tests
>> pass as last time, etc.
>>
>> +1
>>
>> On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.3.2.
>>>
>>> The vote is open until July 11th PST and passes if a majority +1 PMC
>>> votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.2-rc1
>>> (commit 4df06b45160241dbb331153efbb25703f913c192):
>>> https://github.com/apache/spark/tree/v2.3.2-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1277/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>>>
>>> The list of bug fixes going into 2.3.2 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>>>
>>> PS. This is my first time to do release, please help to check if
>>> everything is landing correctly. Thanks ^-^
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.2?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.3.2 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.3.2
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>
>

[VOTE] SPARK 2.3.2 (RC1)

2018-07-08 Thread Saisai Shao

Please vote on releasing the following candidate as Apache Spark version
2.3.2.

The vote is open until July 11th PST and passes if a majority +1 PMC votes
are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.2-rc1
(commit 4df06b45160241dbb331153efbb25703f913c192):
https://github.com/apache/spark/tree/v2.3.2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1277/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/

The list of bug fixes going into 2.3.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343289

PS. This is my first time to do release, please help to check if everything
is landing correctly. Thanks ^-^

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.2?
===

The current list of open tickets targeted at 2.3.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.3.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

Re: Time for 2.3.2?

2018-07-03 Thread Saisai Shao

FYI, currently we have one block issue (
https://issues.apache.org/jira/browse/SPARK-24535), will start the release
after this is fixed.

Also please let me know if there're other blocks or fixes want to land to
2.3.2 release.

Thanks
Saisai

Saisai Shao  于2018年7月2日周一 下午1:16写道：

> I will start preparing the release.
>
> Thanks
>
> John Zhuge  于2018年6月30日周六 上午10:31写道：
>
>> +1  Looking forward to the critical fixes in 2.3.2.
>>
>> On Thu, Jun 28, 2018 at 9:37 AM Ryan Blue 
>> wrote:
>>
>>> +1
>>>
>>> On Thu, Jun 28, 2018 at 9:34 AM Xiao Li  wrote:
>>>
>>>> +1. Thanks, Saisai!
>>>>
>>>> The impact of SPARK-24495 is large. We should release Spark 2.3.2 ASAP.
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>> 2018-06-27 23:28 GMT-07:00 Takeshi Yamamuro :
>>>>
>>>>> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>>>>>
>>>>> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Wenchen Fan 于2018年6月28日 周四下午2:06写道：
>>>>>>
>>>>>>> Hi Saisai, that's great! please go ahead!
>>>>>>>
>>>>>>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1, like mentioned by Marcelo, these issues seems quite severe.
>>>>>>>>
>>>>>>>> I can work on the release if short of hands :).
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Jerry
>>>>>>>>
>>>>>>>>
>>>>>>>> Marcelo Vanzin  于2018年6月28日周四
>>>>>>>> 上午11:40写道：
>>>>>>>>
>>>>>>>>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get
>>>>>>>>> fixes
>>>>>>>>> for those out.
>>>>>>>>>
>>>>>>>>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>>>>>>>>
>>>>>>>>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>>>>>>>>> wrote:
>>>>>>>>> > Hi all,
>>>>>>>>> >
>>>>>>>>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>>>>>>>>> discovered
>>>>>>>>> > and fixed some critical issues afterward.
>>>>>>>>> >
>>>>>>>>> > SPARK-24495: SortMergeJoin may produce wrong result.
>>>>>>>>> > This is a serious correctness bug, and is easy to hit: have
>>>>>>>>> duplicated join
>>>>>>>>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a =
>>>>>>>>> t2.c`, and the
>>>>>>>>> > join is a sort merge join. This bug is only present in Spark 2.3.
>>>>>>>>> >
>>>>>>>>> > SPARK-24588: stream-stream join may produce wrong result
>>>>>>>>> > This is a correctness bug in a new feature of Spark 2.3: the
>>>>>>>>> stream-stream
>>>>>>>>> > join. Users can hit this bug if one of the join side is
>>>>>>>>> partitioned by a
>>>>>>>>> > subset of the join keys.
>>>>>>>>> >
>>>>>>>>> > SPARK-24552: Task attempt numbers are reused when stages are
>>>>>>>>> retried
>>>>>>>>> > This is a long-standing bug in the output committer that may
>>>>>>>>> introduce data
>>>>>>>>> > corruption.
>>>>>>>>> >
>>>>>>>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted
>>>>>>>>> XML to
>>>>>>>>> > access arbitrary files
>>>>>>>>> > This is a potential security issue if users build access control
>>>>>>>>> module upon
>>>>>>>>> > Spark.
>>>>>>>>> >
>>>>>>>>> > I think we need a Spark 2.3.2 to address these issues(especially
>>>>>>>>> the
>>>>>>>>> > correctness bugs) ASAP. Any thoughts?
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Wenchen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Marcelo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>> --
>>> John Zhuge
>>>
>>

Re: Time for 2.3.2?

2018-07-01 Thread Saisai Shao

I will start preparing the release.

Thanks

John Zhuge  于2018年6月30日周六 上午10:31写道：

> +1  Looking forward to the critical fixes in 2.3.2.
>
> On Thu, Jun 28, 2018 at 9:37 AM Ryan Blue 
> wrote:
>
>> +1
>>
>> On Thu, Jun 28, 2018 at 9:34 AM Xiao Li  wrote:
>>
>>> +1. Thanks, Saisai!
>>>
>>> The impact of SPARK-24495 is large. We should release Spark 2.3.2 ASAP.
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> 2018-06-27 23:28 GMT-07:00 Takeshi Yamamuro :
>>>
>>>> +1, I heard some Spark users have skipped v2.3.1 because of these bugs.
>>>>
>>>> On Thu, Jun 28, 2018 at 3:09 PM Xingbo Jiang 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Wenchen Fan 于2018年6月28日 周四下午2:06写道：
>>>>>
>>>>>> Hi Saisai, that's great! please go ahead!
>>>>>>
>>>>>> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
>>>>>> wrote:
>>>>>>
>>>>>>> +1, like mentioned by Marcelo, these issues seems quite severe.
>>>>>>>
>>>>>>> I can work on the release if short of hands :).
>>>>>>>
>>>>>>> Thanks
>>>>>>> Jerry
>>>>>>>
>>>>>>>
>>>>>>> Marcelo Vanzin  于2018年6月28日周四
>>>>>>> 上午11:40写道：
>>>>>>>
>>>>>>>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get
>>>>>>>> fixes
>>>>>>>> for those out.
>>>>>>>>
>>>>>>>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>>>>>>>
>>>>>>>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>>>>>>>> wrote:
>>>>>>>> > Hi all,
>>>>>>>> >
>>>>>>>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>>>>>>>> discovered
>>>>>>>> > and fixed some critical issues afterward.
>>>>>>>> >
>>>>>>>> > SPARK-24495: SortMergeJoin may produce wrong result.
>>>>>>>> > This is a serious correctness bug, and is easy to hit: have
>>>>>>>> duplicated join
>>>>>>>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a =
>>>>>>>> t2.c`, and the
>>>>>>>> > join is a sort merge join. This bug is only present in Spark 2.3.
>>>>>>>> >
>>>>>>>> > SPARK-24588: stream-stream join may produce wrong result
>>>>>>>> > This is a correctness bug in a new feature of Spark 2.3: the
>>>>>>>> stream-stream
>>>>>>>> > join. Users can hit this bug if one of the join side is
>>>>>>>> partitioned by a
>>>>>>>> > subset of the join keys.
>>>>>>>> >
>>>>>>>> > SPARK-24552: Task attempt numbers are reused when stages are
>>>>>>>> retried
>>>>>>>> > This is a long-standing bug in the output committer that may
>>>>>>>> introduce data
>>>>>>>> > corruption.
>>>>>>>> >
>>>>>>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted
>>>>>>>> XML to
>>>>>>>> > access arbitrary files
>>>>>>>> > This is a potential security issue if users build access control
>>>>>>>> module upon
>>>>>>>> > Spark.
>>>>>>>> >
>>>>>>>> > I think we need a Spark 2.3.2 to address these issues(especially
>>>>>>>> the
>>>>>>>> > correctness bugs) ASAP. Any thoughts?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Wenchen
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Marcelo
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>> --
>> John Zhuge
>>
>

Re: Time for 2.3.2?

2018-06-27 Thread Saisai Shao

+1, like mentioned by Marcelo, these issues seems quite severe.

I can work on the release if short of hands :).

Thanks
Jerry


Marcelo Vanzin  于2018年6月28日周四 上午11:40写道：

> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
> for those out.
>
> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>
> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan  wrote:
> > Hi all,
> >
> > Spark 2.3.1 was released just a while ago, but unfortunately we
> discovered
> > and fixed some critical issues afterward.
> >
> > SPARK-24495: SortMergeJoin may produce wrong result.
> > This is a serious correctness bug, and is easy to hit: have duplicated
> join
> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`, and
> the
> > join is a sort merge join. This bug is only present in Spark 2.3.
> >
> > SPARK-24588: stream-stream join may produce wrong result
> > This is a correctness bug in a new feature of Spark 2.3: the
> stream-stream
> > join. Users can hit this bug if one of the join side is partitioned by a
> > subset of the join keys.
> >
> > SPARK-24552: Task attempt numbers are reused when stages are retried
> > This is a long-standing bug in the output committer that may introduce
> data
> > corruption.
> >
> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
> > access arbitrary files
> > This is a potential security issue if users build access control module
> upon
> > Spark.
> >
> > I think we need a Spark 2.3.2 to address these issues(especially the
> > correctness bugs) ASAP. Any thoughts?
> >
> > Thanks,
> > Wenchen
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread Saisai Shao

+1, checked new py4j related changes.

Marcelo Vanzin  于2018年5月17日周四 上午5:41写道：

> This is actually in 2.3, jira is just missing the version.
>
> https://github.com/apache/spark/pull/20765
>
> On Wed, May 16, 2018 at 2:14 PM, kant kodali  wrote:
> > I am not sure how SPARK-23406 is a new feature. since streaming joins are
> > already part of SPARK 2.3.0. The self joins didn't work because of a bug
> and
> > it is fixed but I can understand if it touches some other code paths.
> >
> > On Wed, May 16, 2018 at 3:22 AM, Marco Gaido 
> wrote:
> >>
> >> I'd be against having a new feature in a minor maintenance release. I
> >> think such a release should contain only bugfixes.
> >>
> >> 2018-05-16 12:11 GMT+02:00 kant kodali :
> >>>
> >>> Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of
> >>> 2.3.1?
> >>>
> >>> On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin 
> >>> wrote:
> 
>  Bummer. People should still feel welcome to test the existing RC so we
>  can rule out other issues.
> 
>  On Tue, May 15, 2018 at 2:04 PM, Xiao Li 
> wrote:
>  > -1
>  >
>  > We have a correctness bug fix that was merged after 2.3 RC1. It
> would
>  > be
>  > nice to have that in Spark 2.3.1 release.
>  >
>  > https://issues.apache.org/jira/browse/SPARK-24259
>  >
>  > Xiao
>  >
>  >
>  > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
>  >>
>  >> Please vote on releasing the following candidate as Apache Spark
>  >> version
>  >> 2.3.1.
>  >>
>  >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
>  >> a majority of at least 3 +1 PMC votes are cast.
>  >>
>  >> [ ] +1 Release this package as Apache Spark 2.3.1
>  >> [ ] -1 Do not release this package because ...
>  >>
>  >> To learn more about Apache Spark, please see
> http://spark.apache.org/
>  >>
>  >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
>  >> https://github.com/apache/spark/tree/v2.3.0-rc1
>  >>
>  >> The release files, including signatures, digests, etc. can be found
>  >> at:
>  >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>  >>
>  >> Signatures used for Spark RCs can be found in this file:
>  >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>  >>
>  >> The staging repository for this release can be found at:
>  >>
>  >>
> https://repository.apache.org/content/repositories/orgapachespark-1269/
>  >>
>  >> The documentation corresponding to this release can be found at:
>  >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>  >>
>  >> The list of bug fixes going into 2.3.1 can be found at the
> following
>  >> URL:
>  >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>  >>
>  >> FAQ
>  >>
>  >> =
>  >> How can I help test this release?
>  >> =
>  >>
>  >> If you are a Spark user, you can help us test this release by
> taking
>  >> an existing Spark workload and running on this release candidate,
>  >> then
>  >> reporting any regressions.
>  >>
>  >> If you're working in PySpark you can set up a virtual env and
> install
>  >> the current RC and see if anything important breaks, in the
>  >> Java/Scala
>  >> you can add the staging repository to your projects resolvers and
>  >> test
>  >> with the RC (make sure to clean up the artifact cache before/after
> so
>  >> you don't end up building with a out of date RC going forward).
>  >>
>  >> ===
>  >> What should happen to JIRA tickets still targeting 2.3.1?
>  >> ===
>  >>
>  >> The current list of open tickets targeted at 2.3.1 can be found at:
>  >> https://s.apache.org/Q3Uo
>  >>
>  >> Committers should look at those and triage. Extremely important bug
>  >> fixes, documentation, and API tweaks that impact compatibility
> should
>  >> be worked on immediately. Everything else please retarget to an
>  >> appropriate release.
>  >>
>  >> ==
>  >> But my bug isn't fixed?
>  >> ==
>  >>
>  >> In order to make timely releases, we will typically not hold the
>  >> release unless the bug in question is a regression from the
> previous
>  >> release. That being said, if there is something which is a
> regression
>  >> that has not been correctly targeted please ping me or a committer
> to
>  >> help target the issue.
>  >>
>  >>
>  >> --
>  >> Marcelo
>  >>
>  >>
> -
>  >> To

Re: Hadoop 3 support

2018-04-02 Thread Saisai Shao

Yes, the main blocking issue is the hive version used in Spark
(1.2.1.spark) doesn't support run on Hadoop 3. Hive will check the Hadoop
version in the runtime [1]. Besides this I think some pom changes should be
enough to support Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of
changes, but this is not necessary.

[1]
https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144

2018-04-03 4:57 GMT+08:00 Marcelo Vanzin :

> Saisai filed SPARK-23534, but the main blocking issue is really
> SPARK-18673.
>
>
> On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin  wrote:
> > Does anybody know what needs to be done in order for Spark to support
> Hadoop
> > 3?
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: 回复： Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Saisai Shao

Congrats, Zhenhua!

2018-04-02 16:57 GMT+08:00 Takeshi Yamamuro :

> Congrats, Zhenhua!
>
> On Mon, Apr 2, 2018 at 4:13 PM, Ted Yu  wrote:
>
>> Congratulations, Zhenhua
>>
>>  Original message 
>> From: 雨中漫步 <601450...@qq.com>
>> Date: 4/1/18 11:30 PM (GMT-08:00)
>> To: Yuanjian Li , Wenchen Fan <
>> cloud0...@gmail.com>
>> Cc: dev 
>> Subject: 回复： Welcome Zhenhua Wang as a Spark committer
>>
>> Congratulations Zhenhua Wang
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Yuanjian Li";
>> *发送时间:* 2018年4月2日(星期一) 下午2:26
>> *收件人:* "Wenchen Fan";
>> *抄送:* "Spark dev list";
>> *主题:* Re: Welcome Zhenhua Wang as a Spark committer
>>
>> Congratulations Zhenhua!!
>>
>> 2018-04-02 13:28 GMT+08:00 Wenchen Fan :
>>
>>> Hi all,
>>>
>>> The Spark PMC recently added Zhenhua Wang as a committer on the project.
>>> Zhenhua is the major contributor of the CBO project, and has been
>>> contributing across several areas of Spark for a while, focusing especially
>>> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>>>
>>> Wenchen
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Welcoming some new committers

2018-03-02 Thread Saisai Shao

Congrats to everyone!

Thanks
Jerry

2018-03-03 15:30 GMT+08:00 Liang-Chi Hsieh :

>
> Congrats to everyone!
>
>
> Kazuaki Ishizaki wrote
> > Congratulations to everyone!
> >
> > Kazuaki Ishizaki
> >
> >
> >
> > From:   Takeshi Yamamuro 
>
> > linguin.m.s@
>
> > 
> > To: Spark dev list 
>
> > dev@.apache
>
> > 
> > Date:   2018/03/03 10:45
> > Subject:Re: Welcoming some new committers
> >
> >
> >
> > Congrats, all!
> >
> > On Sat, Mar 3, 2018 at 10:34 AM, Takuya UESHIN 
>
> > ueshin@
>
> > 
> > wrote:
> > Congratulations and welcome!
> >
> > On Sat, Mar 3, 2018 at 10:21 AM, Xingbo Jiang 
>
> > jiangxb1987@
>
> > 
> > wrote:
> > Congratulations to everyone!
> >
> > 2018-03-03 8:51 GMT+08:00 Ilan Filonenko 
>
> > if56@
>
> > :
> > Congrats to everyone! :)
> >
> > On Fri, Mar 2, 2018 at 7:34 PM Felix Cheung 
>
> > felixcheung_m@
>
> > 
> > wrote:
> > Congrats and welcome!
> >
> >
> > From: Dongjoon Hyun 
>
> > dongjoon.hyun@
>
> > 
> > Sent: Friday, March 2, 2018 4:27:10 PM
> > To: Spark dev list
> > Subject: Re: Welcoming some new committers
> >
> > Congrats to all!
> >
> > Bests,
> > Dongjoon.
> >
> > On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan 
>
> > cloud0fan@
>
> >  wrote:
> > Congratulations to everyone and welcome!
> >
> > On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger 
>
> > cody@
>
> >  wrote:
> > Congrats to the new committers, and I appreciate the vote of confidence.
> >
> > On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia 
>
> > matei.zaharia@
>
> > 
> > wrote:
> >> Hi everyone,
> >>
> >> The Spark PMC has recently voted to add several new committers to the
> > project, based on their contributions to Spark 2.3 and other past work:
> >>
> >> - Anirudh Ramanathan (contributor to Kubernetes support)
> >> - Bryan Cutler (contributor to PySpark and Arrow support)
> >> - Cody Koeninger (contributor to streaming and Kafka support)
> >> - Erik Erlandson (contributor to Kubernetes support)
> >> - Matt Cheah (contributor to Kubernetes support and other parts of
> > Spark)
> >> - Seth Hendrickson (contributor to MLlib and PySpark)
> >>
> >> Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth as
> > committers!
> >>
> >> Matei
> >> -
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >
> > -
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >
> >
> >
> >
> >
> >
> >
> > --
> > Takuya UESHIN
> > Tokyo, Japan
> >
> > http://twitter.com/ueshin
> >
> >
> >
> > --
> > ---
> > Takeshi Yamamuro
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Does anyone know how to build spark with scala12.4?

2017-11-28 Thread Saisai Shao

I see, thanks for your quick response.

Best regards,
Jerry

2017-11-29 10:45 GMT+08:00 Sean Owen <so...@cloudera.com>:

> Until the 2.12 build passes tests, no. There is still a real outstanding
> issue with the closure cleaner and serialization of closures as Java 8
> lambdas. I haven't cracked it, and don't think it's simple, but not
> insurmountable.
>
> The funny thing is most stuff appears to just work without cleaning said
> lambdas, because they don't generally capture references in the same way.
> So it may be reasonable to advertise 2.12 support as experimental and for
> people willing to make their own build. That's why I wanted it in good
> enough shape that the scala-2.12 profile produces something basically
> functional.
>
> On Tue, Nov 28, 2017 at 8:43 PM Saisai Shao <sai.sai.s...@gmail.com>
> wrote:
>
>> Hi Sean,
>>
>> Two questions about Scala 2.12 for release artifacts.
>>
>> Are we planning to ship 2.12 artifacts for Spark 2.3 release? If not,
>> will we only ship 2.11 artifacts?
>>
>> Thanks
>> Jerry
>>
>> 2017-11-28 21:51 GMT+08:00 Sean Owen <so...@cloudera.com>:
>>
>>> The Scala 2.12 profile mostly works, but not all tests pass. Use
>>> -Pscala-2.12 on the command line to build.
>>>
>>> On Tue, Nov 28, 2017 at 5:36 AM Ofir Manor <ofir.ma...@equalum.io>
>>> wrote:
>>>
>>>> Hi,
>>>> as far as I know, Spark does not support Scala 2.12.
>>>> There is on-going work to make refactor / fix Spark source code to
>>>> support Scala 2.12 - look for multiple emails on this list in the last
>>>> months from Sean Owen on his progress.
>>>> Once Spark supports Scala 2.12, I think the next target would be JDK 9
>>>> support.
>>>>
>>>> Ofir Manor
>>>>
>>>> Co-Founder & CTO | Equalum
>>>>
>>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>>
>>>> On Tue, Nov 28, 2017 at 9:20 AM, Zhang, Liyun <liyun.zh...@intel.com>
>>>> wrote:
>>>>
>>>>> Hi all:
>>>>>
>>>>>   Does anyone know how to build spark with scala12.4? I want to test
>>>>> whether spark can work on jdk9 or not.  Scala12.4 supports jdk9.  Does
>>>>> anyone try to build spark with scala 12.4 or compile successfully with
>>>>> jdk9.Appreciate to get some feedback from you.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best Regards
>>>>>
>>>>> Kelly Zhang/Zhang,Liyun
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>

Re: Does anyone know how to build spark with scala12.4?

2017-11-28 Thread Saisai Shao

Hi Sean,

Two questions about Scala 2.12 for release artifacts.

Are we planning to ship 2.12 artifacts for Spark 2.3 release? If not, will
we only ship 2.11 artifacts?

Thanks
Jerry

2017-11-28 21:51 GMT+08:00 Sean Owen :

> The Scala 2.12 profile mostly works, but not all tests pass. Use
> -Pscala-2.12 on the command line to build.
>
> On Tue, Nov 28, 2017 at 5:36 AM Ofir Manor  wrote:
>
>> Hi,
>> as far as I know, Spark does not support Scala 2.12.
>> There is on-going work to make refactor / fix Spark source code to
>> support Scala 2.12 - look for multiple emails on this list in the last
>> months from Sean Owen on his progress.
>> Once Spark supports Scala 2.12, I think the next target would be JDK 9
>> support.
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Tue, Nov 28, 2017 at 9:20 AM, Zhang, Liyun 
>> wrote:
>>
>>> Hi all:
>>>
>>>   Does anyone know how to build spark with scala12.4? I want to test
>>> whether spark can work on jdk9 or not.  Scala12.4 supports jdk9.  Does
>>> anyone try to build spark with scala 12.4 or compile successfully with
>>> jdk9.Appreciate to get some feedback from you.
>>>
>>>
>>>
>>>
>>>
>>> Best Regards
>>>
>>> Kelly Zhang/Zhang,Liyun
>>>
>>>
>>>
>>
>>

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Saisai Shao

+1, looking forward to more design details of this feature.

Thanks
Jerry

On Wed, Nov 8, 2017 at 6:40 AM, Shixiong(Ryan) Zhu 
wrote:

> +1
>
> On Tue, Nov 7, 2017 at 1:34 PM, Joseph Bradley 
> wrote:
>
>> +1
>>
>> On Mon, Nov 6, 2017 at 5:11 PM, Michael Armbrust 
>> wrote:
>>
>>> +1
>>>
>>> On Sat, Nov 4, 2017 at 11:02 AM, Xiao Li  wrote:
>>>
 +1

 2017-11-04 11:00 GMT-07:00 Burak Yavuz :

> +1
>
> On Fri, Nov 3, 2017 at 10:02 PM, vaquar khan 
> wrote:
>
>> +1
>>
>> On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu > > wrote:
>>
>>> +1.
>>>
>>> On Sat, Nov 4, 2017 at 8:04 AM, Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>>
 +1 from me too.

 Matei

 > On Nov 3, 2017, at 4:59 PM, Wenchen Fan 
 wrote:
 >
 > +1.
 >
 > I think this architecture makes a lot of sense to let executors
 talk to source/sink directly, and bring very low latency.
 >
 > On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen 
 wrote:
 > +0 simply because I don't feel I know enough to have an opinion.
 I have no reason to doubt the change though, from a skim through the 
 doc.
 >
 >
 > On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin 
 wrote:
 > Earlier I sent out a discussion thread for CP in Structured
 Streaming:
 >
 > https://issues.apache.org/jira/browse/SPARK-20928
 >
 > It is meant to be a very small, surgical change to Structured
 Streaming to enable ultra-low latency. This is great timing because we 
 are
 also designing and implementing data source API v2. If designed 
 properly,
 we can have the same data source API working for both streaming and 
 batch.
 >
 >
 > Following the SPIP process, I'm putting this SPIP up for a vote.
 >
 > +1: Let's go ahead and design / implement the SPIP.
 > +0: Don't really care.
 > -1: I do not think this is a good idea for the following reasons.
 >
 >
 >


 
 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783 <(224)%20436-0783>
>> Greater Chicago
>>
>
>

>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>
>

Re: Moving Scala 2.12 forward one step

2017-08-31 Thread Saisai Shao

Hi Sean,

Do we have a planned target version for Scala 2.12 support? Several other
projects like Zeppelin, Livy which rely on Spark repl also require changes
to support this Scala 2.12.

Thanks
Jerry

On Thu, Aug 31, 2017 at 5:55 PM, Sean Owen  wrote:

> No, this doesn't let Spark build and run on 2.12. It makes changes that
> will be required though, the ones that are really no loss to the current
> 2.11 build.
>
> On Thu, Aug 31, 2017, 10:48 Denis Bolshakov 
> wrote:
>
>> Hello,
>>
>> Sounds amazing. Is there any improvements in benchmarks?
>>
>>
>> On 31 August 2017 at 12:25, Sean Owen  wrote:
>>
>>> Calling attention to the question of Scala 2.12 again for moment. I'd
>>> like to make a modest step towards support. Have a look again, if you
>>> would, at SPARK-14280:
>>>
>>> https://github.com/apache/spark/pull/18645
>>>
>>> This is a lot of the change for 2.12 that doesn't break 2.11, and really
>>> doesn't add any complexity. It's mostly dependency updates and clarifying
>>> some code. Other items like dealing with Kafka 0.8 support, the 2.12 REPL,
>>> etc, are not  here.
>>>
>>> So, this still doesn't result in a working 2.12 build but it's most of
>>> the miscellany that will be required.
>>>
>>> I'd like to merge it but wanted to flag it for feedback as it's not
>>> trivial.
>>>
>>
>>
>>
>> --
>> //with Best Regards
>> --Denis Bolshakov
>> e-mail: bolshakov.de...@gmail.com
>>
>

Re: Spark 2.1.x client with 2.2.0 cluster

2017-08-10 Thread Saisai Shao

As I remembered using Spark 2.1 Driver to communicate with Spark 2.2
executors will throw some RPC exceptions (I don't remember the details of
exception).

On Thu, Aug 10, 2017 at 4:23 PM, Ted Yu  wrote:

> Hi,
> Has anyone used Spark 2.1.x client with Spark 2.2.0 cluster ?
>
> If so, is there any compatibility issue observed ?
>
> Thanks
>

Re: Spark History Server does not redirect to Yarn aggregated logs for container logs

2017-06-08 Thread Saisai Shao

Yes, currently if log is aggregated, then accessing through UI is not
worked, you can create a JIRA to improve this if you would like to.

On Thu, Jun 8, 2017 at 1:43 PM, ckhari4u  wrote:

> Hey Guys,
>
> I am hitting the below issue when trying to access the STDOUT/STDERR logs
> in
> Spark History Server for the executors of a Spark application executed in
> Yarn mode. I have enabled Yarn log aggregation.
>
> Repro Steps:
>
> 1) Run the spark-shell in yarn client mode. Or run Pi job in Yarn mode.
> 2) Once the job is completed, (in the case of spark shell, exit after doing
> some simple operations), try to access the STDOUT or STDERR logs of the
> application from the Executors tab in the Spark History Server UI.
> 3) If yarn log aggregation is enabled, then logs won't be available in node
> manager's log location. But history Server is trying to access the logs
> from
> the nodemanager's log
> location({yarn.nodemanager.log-dirs}/application_${appid) giving below
> error
> in the UI:
>
>
>
> Failed redirect for container_e31_1496881617682_0003_01_02
> ResourceManager
> RM Home
> NodeManager
> Tools
> Failed while trying to construct the redirect url to the log server. Log
> Server url may not be configured
> java.lang.Exception: Unknown container. Container either has not started or
> has already completed or doesn't belong to this node at all.
>
>
> Either Spark History Server should be able to read from the aggregated logs
> and display the logs in the UI or it should give a graceful message. As of
> now its redirecting to the NM webpage and trying to fetch the logs from the
> node managers local location.
>
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-History-
> Server-does-not-redirect-to-Yarn-aggregated-logs-for-
> container-logs-tp21706.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: How about the fetch the shuffle data in one same machine?

2017-05-10 Thread Saisai Shao

There is a JIRA about this thing (
https://issues.apache.org/jira/browse/SPARK-6521). In the current Spark
shuffle fetch still leverages Netty even two executors are on the same
node, but according to the test on the JIRA, the performance is close
whether to bypass network or not. From my understanding, kernel will not
transfer data into NIC if it is just a loopback communication (please
correct me if I'm wrong).

On Wed, May 10, 2017 at 5:53 PM, raintung li  wrote:

> Hi all,
>
> Now Spark only think the executorId same that fetch local file, but for
> same IP different ExecutorId will fetch using network that actually it can
> be fetch in the local Or Loopback.
>
> Apparently fetch the local file that it is fast that can use the LVS
> cache.
>
> How do you think?
>
> Regards
> -Raintung
>

Re: Spark 2.0 and Yarn

2016-08-29 Thread Saisai Shao

This archive contains all the jars required by Spark runtime, you could zip
all the jars under /jars and upload this archive to HDFS, then
configure spark.yarn.archive with the path of this archive on HDFS.

On Sun, Aug 28, 2016 at 9:59 PM, Srikanth Sampath  wrote:

> Hi,
> With SPARK-11157, the big fat assembly jar build was removed.
>
> Has anyone used spark.yarn.archive - the alternative provided and
> successfully deployed Spark on a Yarn cluster.  If so, what does the
> archive
> contain.  What should be the minimal set.  Any suggestion is greatly
> appreciated.
>
> Thanks,
> -Srikanth
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-2-0-and-Yarn-tp18748.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-03 Thread Saisai Shao

Use dominant resource calculator instead of default resource calculator
will get the expected vcores as you wanted. Basically by default yarn does
not honor cpu cores as resource, so you will always see vcore is 1 no
matter what number of cores you set in spark.

On Wed, Aug 3, 2016 at 12:11 PM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:

> Hi All,
>
> I am trying to run a spark job using yarn, and i specify --executor-cores
> value as 20.
> But when i go check the "nodes of the cluster" page in
> http://hostname:8088/cluster/nodes then i see 4 containers getting
> created on each of the node in cluster.
>
> But can only see 1 vcore getting assigned for each containier, even when i
> specify --executor-cores 20 while submitting job using spark-submit.
>
> yarn-site.xml
> 
> yarn.scheduler.maximum-allocation-mb
> 6
> 
> 
> yarn.scheduler.minimum-allocation-vcores
> 1
> 
> 
> yarn.scheduler.maximum-allocation-vcores
> 40
> 
> 
> yarn.nodemanager.resource.memory-mb
> 7
> 
> 
> yarn.nodemanager.resource.cpu-vcores
> 20
> 
>
>
> Did anyone face the same issue??
>
> Regards,
> Satyajit.
>

Re: Issue with Spark Streaming UI

2016-05-24 Thread Saisai Shao

I think it is by design FileInputDStream doesn't support report info,
because FileInputDStream doesn't have event/record concept (it is file
based), so it is hard to define how to correctly report the input info.

Current input info reporting can be supported for all receiver based
InputDStream and DirectKafkaInputDStream.

On Tue, May 24, 2016 at 2:42 PM, Sachin Janani 
wrote:

> Hi,
> I'm trying to run a simple spark streaming application with File Streaming
> and its working properly but when I try to monitor the number of events in
> the Streaming Ui it shows that as 0.Is this a issue and are there any plans
> to fix this.Attached is the screenshot of what it shows on the UI.
>
>
>
> Regards,
> Sachin Janani
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Saisai Shao

Hi Alex,

>From my understanding the community is shifting the effort from RDD based
APIs to Dataset/DataFrame based ones, so for me it is not so necessary to
add a new RDD based API as I mentioned before. Also for the problem of so
many partitions, I think there're many other solutions to handle it.

Of course it is just my own thought.

Thanks
Saisai

On Fri, May 20, 2016 at 1:15 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Saisai, Reynold,
>
> Thank you for your replies.
> I also think that many variation of textFile() methods might be confusing
> for users. Better to have just one good textFile() implementation.
>
> Do you think sc.textFile() should use CombineTextInputFormat instead
> of TextInputFormat?
>
> CombineTextInputFormat allows users to control number of partitions in
> RDD (control split size)
> It's useful for real workloads (e.g. 100 folders, 200,000 files, all files
> are different size, e.g. 100KB - 500MB, total 4TB)
>
> if we use current implementation of sc.textFile() it will generate RDD
> with 250,000+ partitions (one partition for each small file, several
> partitions for big files).
>
> Using CombineTextInputFormat allows us to control number of partitions and
> split size by settign mapreduce.input.fileinputformat.split.maxsize
> property. e.g. if we set it to 256MB spark will generate RDD with ~20,000
> partitions.
>
> It's better to have RDD with 20,000 partitions by 256MB than RDD with
> 250,000+ partition all different sizes from 100KB to 128MB
>
> So, I see only advantages if sc.textFile() starts using CombineTextInputFormat
> instead of TextInputFormat
>
> Alex
>
> On Thu, May 19, 2016 at 8:30 PM, Saisai Shao <sai.sai.s...@gmail.com>
> wrote:
>
>> From my understanding I think newAPIHadoopFile or hadoopFIle is generic
>> enough for you to support any InputFormat you wanted. IMO it is not so
>> necessary to add a new API for this.
>>
>> On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Spark users might not know about CombineTextInputFormat. They probably
>>> think that sc.textFile already implements the best way to read text files.
>>>
>>> I think CombineTextInputFormat can replace regular TextInputFormat in
>>> most of the cases.
>>> Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
>>> On May 19, 2016 2:43 AM, "Reynold Xin" <r...@databricks.com> wrote:
>>>
>>>> Users would be able to run this already with the 3 lines of code you
>>>> supplied right? In general there are a lot of methods already on
>>>> SparkContext and we lean towards the more conservative side in introducing
>>>> new API variants.
>>>>
>>>> Note that this is something we are doing automatically in Spark SQL for
>>>> file sources (Dataset/DataFrame).
>>>>
>>>>
>>>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>>>> apivova...@gmail.com> wrote:
>>>>
>>>>> Hello Everyone
>>>>>
>>>>> Do you think it would be useful to add combinedTextFile method (which
>>>>> uses CombineTextInputFormat) to SparkContext?
>>>>>
>>>>> It allows one task to read data from multiple text files and control
>>>>> number of RDD partitions by setting
>>>>> mapreduce.input.fileinputformat.split.maxsize
>>>>>
>>>>>
>>>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>>>> val conf = sc.hadoopConfiguration
>>>>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>>>> classOf[LongWritable], classOf[Text], conf).
>>>>>   map(pair => pair._2.toString).setName(path)
>>>>>   }
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>
>>>>
>>
>

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Saisai Shao

>From my understanding I think newAPIHadoopFile or hadoopFIle is generic
enough for you to support any InputFormat you wanted. IMO it is not so
necessary to add a new API for this.

On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov 
wrote:

> Spark users might not know about CombineTextInputFormat. They probably
> think that sc.textFile already implements the best way to read text files.
>
> I think CombineTextInputFormat can replace regular TextInputFormat in most
> of the cases.
> Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
> On May 19, 2016 2:43 AM, "Reynold Xin"  wrote:
>
>> Users would be able to run this already with the 3 lines of code you
>> supplied right? In general there are a lot of methods already on
>> SparkContext and we lean towards the more conservative side in introducing
>> new API variants.
>>
>> Note that this is something we are doing automatically in Spark SQL for
>> file sources (Dataset/DataFrame).
>>
>>
>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Hello Everyone
>>>
>>> Do you think it would be useful to add combinedTextFile method (which
>>> uses CombineTextInputFormat) to SparkContext?
>>>
>>> It allows one task to read data from multiple text files and control
>>> number of RDD partitions by setting
>>> mapreduce.input.fileinputformat.split.maxsize
>>>
>>>
>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>> val conf = sc.hadoopConfiguration
>>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>> classOf[LongWritable], classOf[Text], conf).
>>>   map(pair => pair._2.toString).setName(path)
>>>   }
>>>
>>>
>>> Alex
>>>
>>
>>

Re: HDFS as Shuffle Service

2016-04-26 Thread Saisai Shao

Quite curious about the benefits of using HDFS as shuffle service, also
what's the problem of using current shuffle service?


Thanks
Saisai

On Wed, Apr 27, 2016 at 4:31 AM, Timothy Chen  wrote:

> Are you suggesting to have shuffle service persist and fetch data with
> hdfs, or skip shuffle service altogether and just write to hdfs?
>
> Tim
>
>
> > On Apr 26, 2016, at 11:20 AM, Michael Gummelt 
> wrote:
> >
> > Has there been any thought or work on this (or any other networked file
> system)?  It would be valuable to support dynamic allocation without
> depending on the shuffle service.
> >
> > --
> > Michael Gummelt
> > Software Engineer
> > Mesosphere
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: RFC: Remote "HBaseTest" from examples?

2016-04-20 Thread Saisai Shao

+1, HBaseTest in Spark Example is quite old and obsolete, the HBase
connector in HBase repo has evolved a lot, it would be better to guide user
to refer to that not here in Spark example. So good to remove it.

Thanks
Saisai

On Wed, Apr 20, 2016 at 1:41 AM, Josh Rosen 
wrote:

> +1; I think that it's preferable for code examples, especially third-party
> integration examples, to live outside of Spark.
>
> On Tue, Apr 19, 2016 at 10:29 AM Reynold Xin  wrote:
>
>> Yea in general I feel examples that bring in a large amount of
>> dependencies should be outside Spark.
>>
>>
>> On Tue, Apr 19, 2016 at 10:15 AM, Marcelo Vanzin 
>> wrote:
>>
>>> Hey all,
>>>
>>> Two reasons why I think we should remove that from the examples:
>>>
>>> - HBase now has Spark integration in its own repo, so that really
>>> should be the template for how to use HBase from Spark, making that
>>> example less useful, even misleading.
>>>
>>> - It brings up a lot of extra dependencies that make the size of the
>>> Spark distribution grow.
>>>
>>> Any reason why we shouldn't drop that example?
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Saisai Shao

>>>By the way, some people noted that closing PRs may discourage
contributors. I think our open PR count alone is very discouraging. Under
what circumstances would you feel encouraged to open a PR against a project
that has hundreds of open PRs, some from many, many months ago
<https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-asc>
?

I think the original meaning of "discouraging contributors" is  closing
without specific technical reasons, or just lack of bandwidth. These PRs
may not be so important for committers/maintainers, but for individual
contributor especially new open source guy a simple fix for a famous
project means a lot. We actually can have other solutions like setting a
high bar beforehand to reduce the PR number.

Thanks
Jerry



On Tue, Apr 19, 2016 at 11:46 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Relevant: https://github.com/databricks/spark-pr-dashboard/issues/1
>
> A lot of this was discussed a while back when the PR Dashboard was first
> introduced, and several times before and after that as well. (e.g. August
> 2014
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Handling-stale-PRs-td8015.html>
> )
>
> If there is not enough momentum to build the tooling that people are
> discussing here, then perhaps Reynold's suggestion is the most practical
> one that is likely to see the light of day.
>
> I think asking committers to be more active in commenting on PRs is
> theoretically the correct thing to do, but impractical. I'm not a
> committer, but I would guess that most of them are already way
> overcommitted (ha!) and asking them to do more just won't yield results.
>
> We've had several instances in the past where we all tried to rally
> <https://mail-archives.apache.org/mod_mbox/spark-dev/201412.mbox/%3ccaohmdzer4cg_wxgktoxsg8s34krqezygjfzdoymgu9vhyjb...@mail.gmail.com%3E>
> and be more proactive about giving feedback, closing PRs, and nudging
> contributors who have gone silent. My observation is that the level of
> energy required to "properly" curate PR activity in that way is simply not
> sustainable. People can do it for a few weeks and then things revert to the
> way they are now.
>
> Perhaps the missing link that would make this sustainable is better
> tooling. If you think so and can sling some Javascript, you might want to
> contribute to the PR Dashboard <https://spark-prs.appspot.com/>.
>
> Perhaps the missing link is something else: A different PR review process;
> more committers; a higher barrier to contributing; a combination thereof;
> etc...
>
> Also relevant: http://danluu.com/discourage-oss/
>
> By the way, some people noted that closing PRs may discourage
> contributors. I think our open PR count alone is very discouraging. Under
> what circumstances would you feel encouraged to open a PR against a project
> that has hundreds of open PRs, some from many, many months ago
> <https://github.com/apache/spark/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-asc>
> ?
>
> Nick
>
>
> 2016년 4월 18일 (월) 오후 10:30, Ted Yu <yuzhih...@gmail.com>님이 작성:
>
>> During the months of November / December, the 30 day period should be
>> relaxed.
>>
>> Some people(at least in US) may take extended vacation during that time.
>>
>> For Chinese developers, Spring Festival would bear similar circumstance.
>>
>> On Mon, Apr 18, 2016 at 7:25 PM, Hyukjin Kwon <gurwls...@gmail.com>
>> wrote:
>>
>>> I also think this might not have to be closed only because it is
>>> inactive.
>>>
>>>
>>> How about closing issues after 30 days when a committer's comment is
>>> added at the last without responses from the author?
>>>
>>>
>>> IMHO, If the committers are not sure whether the patch would be useful,
>>> then I think they should leave some comments why they are not sure, not
>>> just ignoring.
>>>
>>> Or, simply they could ask the author to prove that the patch is useful
>>> or safe with some references and tests.
>>>
>>>
>>> I think it might be nicer than that users are supposed to keep pinging.
>>> **Personally**, apparently, I am sometimes a bit worried if pinging
>>> multiple times can be a bit annoying.
>>>
>>>
>>>
>>> 2016-04-19 9:56 GMT+09:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>
>>>> It would be better to have a specific technical reason why this PR
>>>> should be closed, either the implementation is not good or the problem is
>>>> not valid, or something else. That will actually help the contributor to
>

Re: auto closing pull requests that have been inactive > 30 days?

2016-04-18 Thread Saisai Shao

It would be better to have a specific technical reason why this PR should
be closed, either the implementation is not good or the problem is not
valid, or something else. That will actually help the contributor to shape
their codes and reopen the PR again. Otherwise reasons like "feel free to
reopen for so-and-so reason" is actually discouraging and no difference
than directly close the PR.

Just my two cents.

Thanks
Jerry


On Tue, Apr 19, 2016 at 4:52 AM, Sean Busbey  wrote:

> Having a PR closed, especially if due to committers not having hte
> bandwidth to check on things, will be very discouraging to new folks.
> Doubly so for those inexperienced with opensource. Even if the message
> says "feel free to reopen for so-and-so reason", new folks who lack
> confidence are going to see reopening as "pestering" and busy folks
> are going to see it as a clear indication that their work is not even
> valuable enough for a human to give a reason for closing. In either
> case, the cost of reopening is substantially higher than that button
> press.
>
> How about we start by keeping a report of "at-risk" PRs that have been
> stale for 30 days to make it easier for committers to look at the prs
> that have been long inactive?
>
> On Mon, Apr 18, 2016 at 2:52 PM, Reynold Xin  wrote:
> > The cost of "reopen" is close to zero, because it is just clicking a
> button.
> > I think you were referring to the cost of closing the pull request, and
> you
> > are assuming people look at the pull requests that have been inactive
> for a
> > long time. That seems equally likely (or unlikely) as committers looking
> at
> > the recently closed pull requests.
> >
> > In either case, most pull requests are scanned through by us when they
> are
> > first open, and if they are important enough, usually they get merged
> > quickly or a target version is set in JIRA. We can definitely improve
> that
> > by making it more explicit.
> >
> >
> >
> > On Mon, Apr 18, 2016 at 12:46 PM, Ted Yu  wrote:
> >>
> >> From committers' perspective, would they look at closed PRs ?
> >>
> >> If not, the cost is not close to zero.
> >> Meaning, some potentially useful PRs would never see the light of day.
> >>
> >> My two cents.
> >>
> >> On Mon, Apr 18, 2016 at 12:43 PM, Reynold Xin 
> wrote:
> >>>
> >>> Part of it is how difficult it is to automate this. We can build a
> >>> perfect engine with a lot of rules that understand everything. But the
> more
> >>> complicated rules we need, the more unlikely for any of these to
> happen. So
> >>> I'd rather do this and create a nice enough message to tell
> contributors
> >>> sometimes mistake happen but the cost to reopen is approximately zero
> (i.e.
> >>> click a button on the pull request).
> >>>
> >>>
> >>> On Mon, Apr 18, 2016 at 12:41 PM, Ted Yu  wrote:
> 
>  bq. close the ones where they don't respond for a week
> 
>  Does this imply that the script understands response from human ?
> 
>  Meaning, would the script use some regex which signifies that the
>  contributor is willing to close the PR ?
> 
>  If the contributor is willing to close, why wouldn't he / she do it
>  him/herself ?
> 
>  On Mon, Apr 18, 2016 at 12:33 PM, Holden Karau 
>  wrote:
> >
> > Personally I'd rather err on the side of keeping PRs open, but I
> > understand wanting to keep the open PRs limited to ones which have a
> > reasonable chance of being merged.
> >
> > What about if we filtered for non-mergeable PRs or instead left a
> > comment asking the author to respond if they are still available to
> move the
> > PR forward - and close the ones where they don't respond for a week?
> >
> > Just a suggestion.
> > On Monday, April 18, 2016, Ted Yu  wrote:
> >>
> >> I had one PR which got merged after 3 months.
> >>
> >> If the inactivity was due to contributor, I think it can be closed
> >> after 30 days.
> >> But if the inactivity was due to lack of review, the PR should be
> kept
> >> open.
> >>
> >> On Mon, Apr 18, 2016 at 12:17 PM, Cody Koeninger <
> c...@koeninger.org>
> >> wrote:
> >>>
> >>> For what it's worth, I have definitely had PRs that sat inactive
> for
> >>> more than 30 days due to committers not having time to look at
> them,
> >>> but did eventually end up successfully being merged.
> >>>
> >>> I guess if this just ends up being a committer ping and reopening
> the
> >>> PR, it's fine, but I don't know if it really addresses the
> underlying
> >>> issue.
> >>>
> >>> On Mon, Apr 18, 2016 at 2:02 PM, Reynold Xin 
> >>> wrote:
> >>> > We have hit a new high in open pull requests: 469 today. While we
> >>> > can
> >>> > certainly get more review bandwidth,

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao

So I think ramdisk is simple way to try.

Besides I think Reynold's suggestion is quite valid, with such high-end
machine, putting everything in memory might not improve the performance a
lot as assumed. Since bottleneck will be shifted, like memory bandwidth,
NUMA, CPU efficiency (serialization-deserialization, data processing...).
Also code design should well consider such usage scenario, to use resource
more efficiently.

Thanks
Saisai

On Sat, Apr 2, 2016 at 7:27 AM, Michael Slavitch <slavi...@gmail.com> wrote:

> Yes we see it on final write.  Our preference is to eliminate this.
>
>
> On Fri, Apr 1, 2016, 7:25 PM Saisai Shao <sai.sai.s...@gmail.com> wrote:
>
>> Hi Michael, shuffle data (mapper output) have to be materialized into
>> disk finally, no matter how large memory you have, it is the design purpose
>> of Spark. In you scenario, since you have a big memory, shuffle spill
>> should not happen frequently, most of the disk IO you see might be final
>> shuffle file write.
>>
>> So if you want to avoid this disk IO, you could use ramdisk as Reynold
>> suggested. If you want to avoid FS overhead of ramdisk, you could try to
>> hack a new shuffle implementation, since shuffle framework is pluggable.
>>
>>
>> On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch <slavi...@gmail.com>
>> wrote:
>>
>>> As I mentioned earlier this flag is now ignored.
>>>
>>>
>>> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch <slavi...@gmail.com>
>>> wrote:
>>>
>>>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>>>> about 500gb of io to disk if compression is enabled. Is there any way to
>>>> eliminate shuffling causing io?
>>>>
>>>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>>> Michael - I'm not sure if you actually read my email, but spill has
>>>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>>>> memory when data doesn't fit in memory.
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch <slavi...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> RAMdisk is a fine interim step but there is a lot of layers
>>>>>> eliminated by keeping things in memory unless there is need for 
>>>>>> spillover.
>>>>>>   At one time there was support for turning off spilling.  That was
>>>>>> eliminated.  Why?
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan <mri...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think Reynold's suggestion of using ram disk would be a good way to
>>>>>>> test if these are the bottlenecks or something else is.
>>>>>>> For most practical purposes, pointing local dir to ramdisk should
>>>>>>> effectively give you 'similar' performance as shuffling from memory.
>>>>>>>
>>>>>>> Are there concerns with taking that approach to test ? (I dont see
>>>>>>> any, but I am not sure if I missed something).
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch <slavi...@gmail.com>
>>>>>>> wrote:
>>>>>>> > I totally disagree that it’s not a problem.
>>>>>>> >
>>>>>>> > - Network fetch throughput on 40G Ethernet exceeds the throughput
>>>>>>> of NVME
>>>>>>> > drives.
>>>>>>> > - What Spark is depending on is Linux’s IO cache as an effective
>>>>>>> buffer pool
>>>>>>> > This is fine for small jobs but not for jobs with datasets in the
>>>>>>> TB/node
>>>>>>> > range.
>>>>>>> > - On larger jobs flushing the cache causes Linux to block.
>>>>>>> > - On a modern 56-hyperthread 2-socket host the latency caused by
>>>>>>> multiple
>>>>>>> > executors writing out to disk increases greatly.
>>>>>>> &

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Saisai Shao

Hi Michael, shuffle data (mapper output) have to be materialized into disk
finally, no matter how large memory you have, it is the design purpose of
Spark. In you scenario, since you have a big memory, shuffle spill should
not happen frequently, most of the disk IO you see might be final shuffle
file write.

So if you want to avoid this disk IO, you could use ramdisk as Reynold
suggested. If you want to avoid FS overhead of ramdisk, you could try to
hack a new shuffle implementation, since shuffle framework is pluggable.


On Sat, Apr 2, 2016 at 6:48 AM, Michael Slavitch  wrote:

> As I mentioned earlier this flag is now ignored.
>
>
> On Fri, Apr 1, 2016, 6:39 PM Michael Slavitch  wrote:
>
>> Shuffling a 1tb set of keys and values (aka sort by key)  results in
>> about 500gb of io to disk if compression is enabled. Is there any way to
>> eliminate shuffling causing io?
>>
>> On Fri, Apr 1, 2016, 6:32 PM Reynold Xin  wrote:
>>
>>> Michael - I'm not sure if you actually read my email, but spill has
>>> nothing to do with the shuffle files on disk. It was for the partitioning
>>> (i.e. sorting) process. If that flag is off, Spark will just run out of
>>> memory when data doesn't fit in memory.
>>>
>>>
>>> On Fri, Apr 1, 2016 at 3:28 PM, Michael Slavitch 
>>> wrote:
>>>
 RAMdisk is a fine interim step but there is a lot of layers eliminated
 by keeping things in memory unless there is need for spillover.   At one
 time there was support for turning off spilling.  That was eliminated.
 Why?


 On Fri, Apr 1, 2016, 6:05 PM Mridul Muralidharan 
 wrote:

> I think Reynold's suggestion of using ram disk would be a good way to
> test if these are the bottlenecks or something else is.
> For most practical purposes, pointing local dir to ramdisk should
> effectively give you 'similar' performance as shuffling from memory.
>
> Are there concerns with taking that approach to test ? (I dont see
> any, but I am not sure if I missed something).
>
>
> Regards,
> Mridul
>
>
>
>
> On Fri, Apr 1, 2016 at 2:10 PM, Michael Slavitch 
> wrote:
> > I totally disagree that it’s not a problem.
> >
> > - Network fetch throughput on 40G Ethernet exceeds the throughput of
> NVME
> > drives.
> > - What Spark is depending on is Linux’s IO cache as an effective
> buffer pool
> > This is fine for small jobs but not for jobs with datasets in the
> TB/node
> > range.
> > - On larger jobs flushing the cache causes Linux to block.
> > - On a modern 56-hyperthread 2-socket host the latency caused by
> multiple
> > executors writing out to disk increases greatly.
> >
> > I thought the whole point of Spark was in-memory computing?  It’s in
> fact
> > in-memory for some things but  use spark.local.dir as a buffer pool
> of
> > others.
> >
> > Hence, the performance of  Spark is gated by the performance of
> > spark.local.dir, even on large memory systems.
> >
> > "Currently it is not possible to not write shuffle files to disk.”
> >
> > What changes >would< make it possible?
> >
> > The only one that seems possible is to clone the shuffle service and
> make it
> > in-memory.
> >
> >
> >
> >
> >
> > On Apr 1, 2016, at 4:57 PM, Reynold Xin  wrote:
> >
> > spark.shuffle.spill actually has nothing to do with whether we write
> shuffle
> > files to disk. Currently it is not possible to not write shuffle
> files to
> > disk, and typically it is not a problem because the network fetch
> throughput
> > is lower than what disks can sustain. In most cases, especially with
> SSDs,
> > there is little difference between putting all of those in memory
> and on
> > disk.
> >
> > However, it is becoming more common to run Spark on a few number of
> beefy
> > nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
> improving
> > performance for those. Meantime, you can setup local ramdisks on
> each node
> > for shuffle writes.
> >
> >
> >
> > On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <
> slavi...@gmail.com>
> > wrote:
> >>
> >> Hello;
> >>
> >> I’m working on spark with very large memory systems (2TB+) and
> notice that
> >> Spark spills to disk in shuffle.  Is there a way to force spark to
> stay in
> >> memory when doing shuffle operations?   The goal is to keep the
> shuffle data
> >> either in the heap or in off-heap memory (in 1.6.x) and never touch
> the IO
> >> subsystem.  I am willing to have the job fail if it runs out of RAM.
> >>
> >> spark.shuffle.spill true  is deprecated in 1.6 and does

Re: Dynamic allocation availability on standalone mode. Misleading doc.

2016-03-07 Thread Saisai Shao

Yes, we need to fix the document.

On Tue, Mar 8, 2016 at 9:07 AM, Mark Hamstra 
wrote:

> Yes, it works in standalone mode.
>
> On Mon, Mar 7, 2016 at 4:25 PM, Eugene Morozov  > wrote:
>
>> Hi, the feature looks like the one I'd like to use, but there are two
>> different descriptions in the docs of whether it's available.
>>
>> I'm on a standalone deployment mode and here:
>> http://spark.apache.org/docs/latest/configuration.html it's specified
>> the feature is available only for YARN, but here:
>> http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>> it says it's available on all coarse-grained cluster managers including
>> standalone.
>>
>> So, is the feature available in standalone mode?
>>
>> Thank you.
>> --
>> Be well!
>> Jean Morozov
>>
>
>

Re: sbt publish-local fails with 2.0.0-SNAPSHOT

2016-02-01 Thread Saisai Shao

I think it is due to our recent changes to override the external resolvers
in sbt building profile, I just created a JIRA (
https://issues.apache.org/jira/browse/SPARK-13109) to track this.


On Mon, Feb 1, 2016 at 3:01 PM, Mike Hynes <91m...@gmail.com> wrote:

> Hi devs,
>
> I used to be able to do some local development from the upstream
> master branch and run the publish-local command in an sbt shell to
> publish the modified jars to the local ~/.ivy2 repository.
>
> I relied on this behaviour, since I could write other local packages
> that had my local 1.X.0-SNAPSHOT dependencies in the build.sbt file,
> such that I could run distributed tests from outside the spark source.
>
> However, having just pulled from the upstream master on
> 2.0.0-SNAPSHOT, I can *not* run publish-local with sbt, with the
> following error messages:
>
> [...]
> java.lang.RuntimeException: Undefined resolver 'local'
> at scala.sys.package$.error(package.scala:27)
> at sbt.IvyActions$$anonfun$publish$1.apply(IvyActions.scala:120)
> at sbt.IvyActions$$anonfun$publish$1.apply(IvyActions.scala:117)
> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:155)
> at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:132)
> at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
> at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
> at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
> at
> xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
> at
> xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
> at xsbt.boot.Using$.withResource(Using.scala:10)
> at xsbt.boot.Using$.apply(Using.scala:9)
> at
> xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
> at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
> at xsbt.boot.Locks$.apply0(Locks.scala:31)
> at xsbt.boot.Locks$.apply(Locks.scala:28)
> at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
> at sbt.IvySbt.withIvy(Ivy.scala:127)
> at sbt.IvySbt.withIvy(Ivy.scala:124)
> at sbt.IvySbt$Module.withModule(Ivy.scala:155)
> at sbt.IvyActions$.publish(IvyActions.scala:117)
> at sbt.Classpaths$$anonfun$publishTask$1.apply(Defaults.scala:1298)
> at sbt.Classpaths$$anonfun$publishTask$1.apply(Defaults.scala:1297)
> at scala.Function3$$anonfun$tupled$1.apply(Function3.scala:35)
> at scala.Function3$$anonfun$tupled$1.apply(Function3.scala:34)
> at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
> at
> sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
> at sbt.std.Transform$$anon$4.work(System.scala:63)
> at
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
> at
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
> at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
> at sbt.Execute.work(Execute.scala:235)
> at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
> at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
> at
> sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
> at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> [...]
> [error] (spark/*:publishLocal) Undefined resolver 'local'
> [error] (hive/*:publishLocal) Undefined resolver 'local'
> [error] (streaming-kafka-assembly/*:publishLocal) Undefined resolver
> 'local'
> [error] (unsafe/*:publishLocal) Undefined resolver 'local'
> [error] (streaming-twitter/*:publishLocal) Undefined resolver 'local'
> [error] (streaming-flume/*:publishLocal) Undefined resolver 'local'
> [error] (streaming-kafka/*:publishLocal) Undefined resolver 'local'
> [error] (catalyst/*:publishLocal) Undefined resolver 'local'
> [error] (streaming-akka/*:publishLocal) Undefined resolver 'local'
> [error] (streaming-flume-sink/*:publishLocal) Undefined resolver 'local'
> [error] (streaming-zeromq/*:publishLocal) Undefined resolver 'local'
> [error] (test-tags/*:publishLocal) Undefined resolver 'local'
> [error] (launcher/*:publishLocal) Undefined resolver 'local'
> [error] (network-shuffle/*:publishLocal) Undefined resolver 'local'
> [error] (streaming-mqtt-assembly/*:publishLocal) Undefined resolver 'local'
> [error] (assembly/*:publishLocal)

Re: spark with label nodes in yarn

2015-12-15 Thread Saisai Shao

SPARK-6470 only supports node label expression for executors.
SPARK-7173 supports node label expression for AM (will be in 1.6).

If you want to schedule your whole application through label expression,
you have to configure both am and executor label expression. If you only
want to schedule executors through label expression, the executor
configuration is enough, but you have to make sure your cluster has some
nodes with no label.

You can refer to this document (
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/configuring_node_labels.html
).

Thanks
Saisai

On Tue, Dec 15, 2015 at 5:55 PM, 张志强(旺轩)  wrote:

> Hi Ted,
>
>
>
> Thanks for your quick response, but I think the link you gave it to me is
> more advanced feature.
>
> Yes, I noticed SPARK-6470(https://issues.apache.org/jira/browse/SPARK-6470)
>
>
> And I just tried for this feature with spark 1.5.0, what happened to me
> was I was blocked to get the YARN containers by setting
> spark.yarn.executor.nodeLabelExpression property. My question,
> https://issues.apache.org/jira/browse/SPARK-7173 will fix this?
>
>
>
> Thanks
>
> Allen
>
>
>
>
>
> *发件人:* Ted Yu [mailto:yuzhih...@gmail.com]
> *发送时间:* 2015年12月15日 17:39
> *收件人:* 张志强(旺轩)
> *抄送:* dev@spark.apache.org
> *主题:* Re: spark with label nodes in yarn
>
>
>
> Please take a look at:
>
> https://issues.apache.org/jira/browse/SPARK-7173
>
>
>
> Cheers
>
>
> On Dec 15, 2015, at 1:23 AM, 张志强(旺轩)  wrote:
>
> Hi all,
>
>
>
> Has anyone tried label based scheduling via spark on yarn? I’ve tried
> that, it didn’t work, spark 1.4.1 + apache hadoop 2.6.0
>
>
>
> Any feedbacks are welcome.
>
>
>
> Thanks
>
> Allen
>
>

Re: spark with label nodes in yarn

2015-12-15 Thread Saisai Shao

Yes, of course, capacity scheduler also needs to be configured.

On Wed, Dec 16, 2015 at 10:41 AM, 张志强(旺轩) <zzq98...@alibaba-inc.com> wrote:

> one more question , do I have to configure label for my capacity
> scheduler? is this mandatory?
>
>
>
> *发件人:* AllenZ [mailto:zzq98...@alibaba-inc.com]
> *发送时间:* 2015年12月16日 9:21
> *收件人:* 'Ted Yu'
> *抄送:* 'Saisai Shao'; 'dev'
> *主题:* Re: spark with label nodes in yarn
>
>
>
> Oops...
>
>
>
> I do use spark 1.5.0 and apache hadoop 2.6.0 (spark 1.4.1 + apache hadoop
> 2.6.0 is a typo), sorry
>
>
>
> Thanks,
>
> Allen
>
>
>
> *发件人:* Ted Yu [mailto:yuzhih...@gmail.com]
> *发送时间:* 2015年12月15日 22:59
> *收件人:* 张志强(旺轩)
> *抄送:* Saisai Shao; dev
> *主题:* Re: spark with label nodes in yarn
>
>
>
> Please upgrade to Spark 1.5.x
>
>
>
> 1.4.1 didn't support node label feature.
>
>
>
> Cheers
>
>
>
> On Tue, Dec 15, 2015 at 2:20 AM, 张志强(旺轩) <zzq98...@alibaba-inc.com> wrote:
>
> Hi SaiSai,
>
>
>
> OK, it make sense to me , what I need is just to schedule the executors,
> AND I leave one nodemanager at least with no any labels.
>
>
>
> It’s weird to me that YARN page shows my application is running, but
> actually it is still waiting for its executor
>
>
>
> See the attached.
>
>
>
> Thanks,
>
> Allen
>
>
>
> *发件人:* Saisai Shao [mailto:sai.sai.s...@gmail.com]
> *发送时间:* 2015年12月15日 18:07
> *收件人:* 张志强(旺轩)
> *抄送:* Ted Yu; dev
>
> *主题:* Re: spark with label nodes in yarn
>
>
>
> SPARK-6470 only supports node label expression for executors.
>
> SPARK-7173 supports node label expression for AM (will be in 1.6).
>
>
>
> If you want to schedule your whole application through label expression,
> you have to configure both am and executor label expression. If you only
> want to schedule executors through label expression, the executor
> configuration is enough, but you have to make sure your cluster has some
> nodes with no label.
>
>
>
> You can refer to this document (
> http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/configuring_node_labels.html
> ).
>
>
>
> Thanks
>
> Saisai
>
>
>
>
>
> On Tue, Dec 15, 2015 at 5:55 PM, 张志强(旺轩) <zzq98...@alibaba-inc.com> wrote:
>
> Hi Ted,
>
>
>
> Thanks for your quick response, but I think the link you gave it to me is
> more advanced feature.
>
> Yes, I noticed SPARK-6470(https://issues.apache.org/jira/browse/SPARK-6470)
>
>
> And I just tried for this feature with spark 1.5.0, what happened to me
> was I was blocked to get the YARN containers by setting
> spark.yarn.executor.nodeLabelExpression property. My question,
> https://issues.apache.org/jira/browse/SPARK-7173 will fix this?
>
>
>
> Thanks
>
> Allen
>
>
>
>
>
> *发件人:* Ted Yu [mailto:yuzhih...@gmail.com]
> *发送时间:* 2015年12月15日 17:39
> *收件人:* 张志强(旺轩)
> *抄送:* dev@spark.apache.org
> *主题:* Re: spark with label nodes in yarn
>
>
>
> Please take a look at:
>
> https://issues.apache.org/jira/browse/SPARK-7173
>
>
>
> Cheers
>
>
> On Dec 15, 2015, at 1:23 AM, 张志强(旺轩) <zzq98...@alibaba-inc.com> wrote:
>
> Hi all,
>
>
>
> Has anyone tried label based scheduling via spark on yarn? I’ve tried
> that, it didn’t work, spark 1.4.1 + apache hadoop 2.6.0
>
>
>
> Any feedbacks are welcome.
>
>
>
> Thanks
>
> Allen
>
>
>
>
>

Re: tests blocked at "don't call ssc.stop in listener"

2015-11-26 Thread Saisai Shao

Might be related to this JIRA (
https://issues.apache.org/jira/browse/SPARK-11761), not very sure about it.

On Fri, Nov 27, 2015 at 10:22 AM, Nan Zhu  wrote:

> Hi, all
>
> Anyone noticed that some of the tests just blocked at the test case “don't
> call ssc.stop in listener” in StreamingListenerSuite?
>
> Examples:
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46766/console
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46776/console
>
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46774/console
>
>
> I originally found it in my own PR, and I thought it is a bug introduced
> by me….but later I found that the tests for the PRs on different things
> also blocked at the same point …
>
> Just filed a JIRA https://issues.apache.org/jira/browse/SPARK-12021
>
>
> Best,
>
> --
> Nan Zhu
> http://codingcat.me
>
>

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

2015-11-20 Thread Saisai Shao

+1.

Hadoop 2.6 would be a good choice with many features added (like supporting
long running service, label based scheduling). Currently there's lot of
reflection codes to support multiple version of Yarn, so upgrading to a
newer version will really ease the pain :).

Thanks
Saisai

On Fri, Nov 20, 2015 at 3:58 PM, Jean-Baptiste Onofré 
wrote:

> +1
>
> Regards
> JB
>
>
> On 11/19/2015 11:14 PM, Reynold Xin wrote:
>
>> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
>> think everybody is for that.
>>
>> https://issues.apache.org/jira/browse/SPARK-11807
>>
>> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That
>> is to say, keep only Hadoop 2.6 and greater.
>>
>> What are the community's thoughts on that?
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Saisai Shao

Kafka now build-in supports managing metadata itself besides ZK, it is easy
to use and change from current ZK implementation. I think here the problem
is do we need to manage offset in Spark Streaming level or leave this
question to user.

If you want to manage offset in user level, letting Spark to offer a
convenient API, I think Cody's patch (
https://issues.apache.org/jira/browse/SPARK-10963) could satisfy your needs.

If you hope to let Spark Streaming to manage offsets for you (transparent
to the user level), I think I had a PR before but the community inclines to
leave this to user level.

On Tue, Nov 17, 2015 at 9:27 AM, Nick Evans  wrote:

> The only dependancy on Zookeeper I see is here:
> https://github.com/apache/spark/blob/1c5475f1401d2233f4c61f213d1e2c2ee9673067/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ReliableKafkaReceiver.scala#L244-L247
>
> If that's the only line that depends on Zookeeper, we could probably try
> to implement an abstract offset manager that could be switched out in
> favour of the new offset management system, yes? I
> know kafka.consumer.Consumer currently depends on Zookeeper, but I'm
> guessing this library will eventually be updated to use the new method.
>
> On Mon, Nov 16, 2015 at 5:28 PM, Cody Koeninger 
> wrote:
>
>> There are already private methods in the code for interacting with
>> Kafka's offset management api.
>>
>> There's a jira for making those methods public, but TD has been reluctant
>> to merge it
>>
>> https://issues.apache.org/jira/browse/SPARK-10963
>>
>> I think adding any ZK specific behavior to spark is a bad idea, since ZK
>> may no longer be the preferred storage location for Kafka offsets within
>> the next year.
>>
>>
>>
>> On Mon, Nov 16, 2015 at 9:53 AM, Nick Evans  wrote:
>>
>>> I really like the Streaming receiverless API for Kafka streaming jobs,
>>> but I'm finding the manual offset management adds a fair bit of complexity.
>>> I'm sure that others feel the same way, so I'm proposing that we add the
>>> ability to have consumer offsets managed via an easy-to-use API. This would
>>> be done similarly to how it is done in the receiver API.
>>>
>>> I haven't written any code yet, but I've looked at the current version
>>> of the codebase and have an idea of how it could be done.
>>>
>>> To keep the size of the pull requests small, I propose that the
>>> following distinct features are added in order:
>>>
>>>1. If a group ID is set in the Kafka params, and also if fromOffsets
>>>is not passed in to createDirectStream, then attempt to resume from the
>>>remembered offsets for that group ID.
>>>2. Add a method on KafkaRDDs that commits the offsets for that
>>>KafkaRDD to Zookeeper.
>>>3. Update the Python API with any necessary changes.
>>>
>>> My goal is to not break the existing API while adding the new
>>> functionality.
>>>
>>> One point that I'm not sure of is regarding the first point. I'm not
>>> sure whether it's a better idea to set the group ID as mentioned through
>>> Kafka params, or to define a new overload of createDirectStream that
>>> expects the group ID in place of the fromOffsets param. I think the latter
>>> is a cleaner interface, but I'm not sure whether adding a new param is a
>>> good idea.
>>>
>>> If anyone has any feedback on this general approach, I'd be very
>>> grateful. I'm going to open a JIRA in the next couple days and begin
>>> working on the first point, but I think comments from the community would
>>> be very helpful on building a good API here.
>>>
>>>
>>
>
>
> --
> *Nick Evans* 
> P. (613) 793-5565
> LinkedIn  | Website 
>
>

Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.

2015-10-20 Thread Saisai Shao

Hi Prakhar,

I start to know your problem, you expected that the killed exexcutor by
heartbeat mechanism should be launched again but seems not. This problem I
think is fixed in the version 1.5 of Spark, you could check this jira
https://issues.apache.org/jira/browse/SPARK-8119

Thanks
Saisai

2015年10月20日星期二，prakhar jauhari <prak...@gmail.com> 写道：

> Thanks sai for the input,
>
> So the problem is : i start my job with some fixed number of executors,
> but when a host running my executors goes unreachable, driver reduces the
> total number of executors. And never increases it.
>
> I have a repro for the issue, attaching logs:
>  Running spark job is configured for 2 executors, dynamic allocation
> not enabled !!!
>
> AM starts requesting the 2 executors:
> 15/10/19 12:25:58 INFO yarn.YarnRMClient: Registering the ApplicationMaster
> 15/10/19 12:25:59 INFO yarn.YarnAllocator: Will request 2 executor
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: <memory:1408, vCores:1>)
> 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: <memory:1408, vCores:1>)
> 15/10/19 12:25:59 INFO yarn.ApplicationMaster: Started progress reporter
> thread - sleep time : 5000
>
> Executors launched:
> 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for :
> DN-2:58739
> 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for :
> DN-1:44591
> 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container
> container_1444841612643_0014_01_02 for on host DN-2
> 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler,
> executorHostname: DN-2
> 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container
> container_1444841612643_0014_01_03 for on host DN-1
> 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler,
> executorHostname: DN-1
>
> Now my AM and executor 1 are running on DN-2, DN-1 has executor 2 running
> on it. To reproduce this issue I removed IP from DN-1, until it was timed
> out by spark.
> 15/10/19 13:03:30 INFO yarn.YarnAllocator: Driver requested a total number
> of 1 executor(s).
> 15/10/19 13:03:30 INFO yarn.ApplicationMaster: Driver requested to kill
> executor(s) 2.
>
>
> So the driver has reduced the total number of executor to : 1
> And now even when the DN comes up and rejoins the cluster, this count is
> not increased.
> If I had executor 1 running on a separate DN (not the same as AM's DN),
> and that DN went unreachable, driver would reduce total number of executor
> to : 0 and the job hangs forever. And this is when i have not enabled
> Dynamic allocation. My cluster has other DN's available, AM should request
> the killed executors from yarn, and get it on some other DN's.
>
> Regards,
> Prakhar
>
>
> On Mon, Oct 19, 2015 at 2:47 PM, Saisai Shao <sai.sai.s...@gmail.com
> <javascript:_e(%7B%7D,'cvml','sai.sai.s...@gmail.com');>> wrote:
>
>> This is a deliberate killing request by heartbeat mechanism, have nothing
>> to do with dynamic allocation. Here because you're running on yarn mode, so
>> "supportDynamicAllocation" will be true, but actually there's no
>> relation to dynamic allocation.
>>
>> From my understanding "doRequestTotalExecutors" is to sync the current
>> total executor number with AM, AM will try to cancel some pending container
>> requests when current expected executor number is less. The actual
>> container killing command is issued by "doRequestTotalExecutors".
>>
>> Not sure what is your actual problem? is it unexpected?
>>
>> Thanks
>> Saisai
>>
>>
>> On Mon, Oct 19, 2015 at 3:51 PM, prakhar jauhari <prak...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','prak...@gmail.com');>> wrote:
>>
>>> Hey all,
>>>
>>> Thanks in advance. I ran into a situation where spark driver reduced the
>>> total executors count for my job even with dynamic allocation disabled,
>>> and
>>> caused the job to hang for ever.
>>>
>>> Setup:
>>> Spark-1.3.1 on hadoop-yarn-2.4.0 cluster.
>>> All servers in cluster running Linux version 2.6.32.
>>> Job in yarn-client mode.
>>>
>>> Scenario:
>>> 1. Application running with required number of executors.
>>> 2. One of the DN's losses connectivity and is timed out.
>>&

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-12 Thread Saisai Shao

Scala KafkaRDD uses a trait to handle this problem, but it is not so easy
and straightforward in Python, where we need to have a specific API to
handle this, I'm not sure is there any simple workaround to fix this, maybe
we should think carefully about it.

2015-06-12 13:59 GMT+08:00 Amit Ramesh a...@yelp.com:


 Thanks, Jerry. That's what I suspected based on the code I looked at. Any
 pointers on what is needed to build in this support would be great. This is
 critical to the project we are currently working on.

 Thanks!


 On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 OK, I get it, I think currently Python based Kafka direct API do not
 provide such equivalence like Scala, maybe we should figure out to add this
 into Python API also.

 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com:


 Hi Jerry,

 Take a look at this example:
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2

 The offsets are needed because as RDDs get generated within spark the
 offsets move further along. With direct Kafka mode the current offsets are
 no more persisted in Zookeeper but rather within Spark itself. If you want
 to be able to use zookeeper based monitoring tools to keep track of
 progress, then this is needed.

 In my specific case we need to persist Kafka offsets externally so that
 we can continue from where we left off after a code deployment. In other
 words, we need exactly-once processing guarantees across code deployments.
 Spark does not support any state persistence across deployments so this is
 something we need to handle on our own.

 Hope that helps. Let me know if not.

 Thanks!
 Amit


 On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Hi,

 What is your meaning of getting the offsets from the RDD, from my
 understanding, the offsetRange is a parameter you offered to KafkaRDD, why
 do you still want to get the one previous you set into?

 Thanks
 Jerry

 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com:


 Congratulations on the release of 1.4!

 I have been trying out the direct Kafka support in python but haven't
 been able to figure out how to get the offsets from the RDD. Looks like 
 the
 documentation is yet to be updated to include Python examples (
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
 I am specifically looking for the equivalent of
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
 I tried digging through the python code but could not find anything
 related. Any pointers would be greatly appreciated.

 Thanks!
 Amit

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao

OK, I get it, I think currently Python based Kafka direct API do not
provide such equivalence like Scala, maybe we should figure out to add this
into Python API also.

2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com:


 Hi Jerry,

 Take a look at this example:
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2

 The offsets are needed because as RDDs get generated within spark the
 offsets move further along. With direct Kafka mode the current offsets are
 no more persisted in Zookeeper but rather within Spark itself. If you want
 to be able to use zookeeper based monitoring tools to keep track of
 progress, then this is needed.

 In my specific case we need to persist Kafka offsets externally so that we
 can continue from where we left off after a code deployment. In other
 words, we need exactly-once processing guarantees across code deployments.
 Spark does not support any state persistence across deployments so this is
 something we need to handle on our own.

 Hope that helps. Let me know if not.

 Thanks!
 Amit


 On Thu, Jun 11, 2015 at 10:02 PM, Saisai Shao sai.sai.s...@gmail.com
 wrote:

 Hi,

 What is your meaning of getting the offsets from the RDD, from my
 understanding, the offsetRange is a parameter you offered to KafkaRDD, why
 do you still want to get the one previous you set into?

 Thanks
 Jerry

 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com:


 Congratulations on the release of 1.4!

 I have been trying out the direct Kafka support in python but haven't
 been able to figure out how to get the offsets from the RDD. Looks like the
 documentation is yet to be updated to include Python examples (
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html).
 I am specifically looking for the equivalent of
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
 I tried digging through the python code but could not find anything
 related. Any pointers would be greatly appreciated.

 Thanks!
 Amit

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao

Hi,

What is your meaning of getting the offsets from the RDD, from my
understanding, the offsetRange is a parameter you offered to KafkaRDD, why
do you still want to get the one previous you set into?

Thanks
Jerry

2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com:


 Congratulations on the release of 1.4!

 I have been trying out the direct Kafka support in python but haven't been
 able to figure out how to get the offsets from the RDD. Looks like the
 documentation is yet to be updated to include Python examples (
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html). I
 am specifically looking for the equivalent of
 https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2.
 I tried digging through the python code but could not find anything
 related. Any pointers would be greatly appreciated.

 Thanks!
 Amit

Re: python/run-tests fails at spark master branch

2015-04-22 Thread Saisai Shao

Hi Hrishikesh,

Seems the behavior of Kafka-assembly is a little different when using Maven
to sbt. The assembly jar name and location is different while using `mvn
package`. This is a actually bug, I'm fixing this now.

Thanks
Jerry


2015-04-22 13:37 GMT+08:00 Hrishikesh Subramonian 
hrishikesh.subramon...@flytxt.com:

  Hi,

 The *python/run-tests* executes successfully after I ran *'build/sbt
 assembly*' command. But the tests fail if I run it after *'mvn
 -Dskiptests clean package'* command. Why does it run in *sbt assembly*
 and not in* mvn package*?

 --
 Hrishikesh

 On Wednesday 22 April 2015 07:38 AM, Saisai Shao wrote:

 Hi Hrishikesh,

  Now we add Kafka unit test for python which relies on Kafka assembly
 jar, so you need to run `sbt assembly` or mvn `package` at first to get an
 assemble jar.



 2015-04-22 1:15 GMT+08:00 Marcelo Vanzin van...@cloudera.com:

 On Tue, Apr 21, 2015 at 1:30 AM, Hrishikesh Subramonian
 hrishikesh.subramon...@flytxt.com wrote:

  Run streaming tests ...
  Failed to find Spark Streaming Kafka assembly jar in
  /home/xyz/spark/external/kafka-assembly
  You need to build Spark with  'build/sbt assembly/assembly
  streaming-kafka-assembly/assembly' or 'build/mvn package' before running
  this program
 
 
  Is anybody facing the same problem?

 Have you built the assemblies before running the tests? (mvn package
 -DskipTests, or sbt assembly)


 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: python/run-tests fails at spark master branch

2015-04-21 Thread Saisai Shao

Hi Hrishikesh,

Now we add Kafka unit test for python which relies on Kafka assembly jar,
so you need to run `sbt assembly` or mvn `package` at first to get an
assemble jar.



2015-04-22 1:15 GMT+08:00 Marcelo Vanzin van...@cloudera.com:

 On Tue, Apr 21, 2015 at 1:30 AM, Hrishikesh Subramonian
 hrishikesh.subramon...@flytxt.com wrote:

  Run streaming tests ...
  Failed to find Spark Streaming Kafka assembly jar in
  /home/xyz/spark/external/kafka-assembly
  You need to build Spark with  'build/sbt assembly/assembly
  streaming-kafka-assembly/assembly' or 'build/mvn package' before running
  this program
 
 
  Is anybody facing the same problem?

 Have you built the assemblies before running the tests? (mvn package
 -DskipTests, or sbt assembly)


 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Understanding shuffle file name conflicts

2015-03-25 Thread Saisai Shao

Yes as Josh said, when application is started, Spark will create a unique
application-wide folder for related temporary files. And jobs in this
application will have a unique shuffle id with unique file names, so
shuffle stages within app will not meet name conflicts.

Also shuffle files between applications are separated by application
folder, so the name conflicts cannot be happened.

Maybe you changed some parts of the code while do the patch.

Thanks
Jerry

2015-03-25 14:22 GMT+08:00 Josh Rosen rosenvi...@gmail.com:

Which version of Spark are you using? What do you mean when you say that
you used a hardcoded location for shuffle files?

If you look at the current DiskBlockManager code, it looks like it will
create a per-application subdirectory in each of the local root directories.

Here's the call to create a subdirectory in each root dir:
https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L126

This call to Utils.createDirectory() should result in a fresh subdirectory
being created for just this application (note the use of random UUIDs, plus
the check to ensure that the directory doesn't already exist):

https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/util/Utils.scala#L273

So, although the filenames for shuffle files are not globally unique,
their full paths should be unique due to these unique per-application
subdirectories. Have you observed an instance where this isn't the case?

- Josh

On Tue, Mar 24, 2015 at 11:04 PM, Kannan Rajah kra...@maprtech.com
wrote:

Saisai,
This is the not the case when I use spark-submit to run 2 jobs, one after
another. The shuffle id remains the same.

--
Kannan

On Tue, Mar 24, 2015 at 7:35 PM, Saisai Shao sai.sai.s...@gmail.com
wrote:

Hi Kannan,

As I know the shuffle Id in ShuffleDependency will be increased, so even
if you run the same job twice, the shuffle dependency as well as
shuffle id
is different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name conflict
even in the same directory as I know.

Thanks
Jerry

2015-03-25 1:56 GMT+08:00 Kannan Rajah kra...@maprtech.com:

I am working on SPARK-1529. I ran into an issue with my change, where
the
same shuffle file was being reused across 2 jobs. Please note this only
happens when I use a hard coded location to use for shuffle files, say
/tmp. It does not happen with normal code path that uses
DiskBlockManager
to pick different directories for each run. So I want to understand how
DiskBlockManager guarantees that such a conflict will never happen.

Let's say the shuffle block id has a value of shuffle_0_0_0. So the
data
file name is shuffle_0_0_0.data and index file name is
shuffle_0_0_0.index.
If I run a spark job twice, one after another, these files get created
under different directories because of the hashing logic in
DiskBlockManager. But the hash is based off the file name, so how are
we
sure that there won't be a conflict ever?

--
Kannan

Re: Understanding shuffle file name conflicts

2015-03-25 Thread Saisai Shao

DIskBlockManager doesn't need to know the app id, all it need to do is to
create a folder with a unique name (UUID based) and then put all the
shuffle files into it.

you can see the code in DiskBlockManager as below, it will create a bunch
unique folders when initialized, these folders are app specific

private[spark] val localDirs: Array[File] = createLocalDirs(conf)

UUID is for creating an app specific folder. and shuffle file hashed by
shuffle block id, which is deterministic by using getFile as you mentioned.

2015-03-25 15:03 GMT+08:00 Kannan Rajah kra...@maprtech.com:

Josh Saisai,
When I say I am using a hardcoded location for shuffle files, I mean that
I am not using DiskBlockManager.getFile API because that uses the
directories created locally on the node. But for my use case, I need to
look at creating those shuffle files on HDFS.

I will take a closer look at this. But I have a couple of questions. From
what I understand, DiskBlockManager code does not know about any
application ID. It seems to pick up the top root temp dir location from
SparkConf and then creates a bunch of sub dir under it. When a shuffle file
needs to be created using getFile API, it hashes it to one of the existing
dir. At this point, I don't see any app specific directory. Can you point
out what I am missing here? The getFile API does not involve the random
UUIDs. The random UUID generation happens inside createTempShuffleBlock and
that is invoke only from ExternalSorter. On the other hand,
DiskBlockManager.getFile is used to create the shuffle index and data file.

--
Kannan

On Tue, Mar 24, 2015 at 11:56 PM, Saisai Shao sai.sai.s...@gmail.com
wrote:

Also shuffle files between applications are separated by application
folder, so the name conflicts cannot be happened.

Maybe you changed some parts of the code while do the patch.

Thanks
Jerry

2015-03-25 14:22 GMT+08:00 Josh Rosen rosenvi...@gmail.com:

Which version of Spark are you using? What do you mean when you say
that you used a hardcoded location for shuffle files?

If you look at the current DiskBlockManager code, it looks like it will
create a per-application subdirectory in each of the local root directories.

This call to Utils.createDirectory() should result in a fresh
subdirectory being created for just this application (note the use of
random UUIDs, plus the check to ensure that the directory doesn't already
exist):

https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/util/Utils.scala#L273

- Josh

On Tue, Mar 24, 2015 at 11:04 PM, Kannan Rajah kra...@maprtech.com
wrote:

Saisai,
This is the not the case when I use spark-submit to run 2 jobs, one
after
another. The shuffle id remains the same.

--
Kannan

On Tue, Mar 24, 2015 at 7:35 PM, Saisai Shao sai.sai.s...@gmail.com
wrote:

Hi Kannan,

As I know the shuffle Id in ShuffleDependency will be increased, so
even
if you run the same job twice, the shuffle dependency as well as
shuffle id
is different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name
conflict
even in the same directory as I know.

Thanks
Jerry

2015-03-25 1:56 GMT+08:00 Kannan Rajah kra...@maprtech.com:

I am working on SPARK-1529. I ran into an issue with my change,
where the
same shuffle file was being reused across 2 jobs. Please note this
only
happens when I use a hard coded location to use for shuffle files,
say
/tmp. It does not happen with normal code path that uses
DiskBlockManager
to pick different directories for each run. So I want to understand
how
DiskBlockManager guarantees that such a conflict will never happen.

Let's say the shuffle block id has a value of shuffle_0_0_0. So the
data
file name is shuffle_0_0_0.data and index file name is
shuffle_0_0_0.index.
If I run a spark job twice, one after another, these files get
created
under different directories because of the hashing logic in
DiskBlockManager. But the hash is based off the file name, so how
are we
sure that there won't be a conflict ever?

--
Kannan

Re: Understanding shuffle file name conflicts

2015-03-24 Thread Saisai Shao

Hi Kannan,

As I know the shuffle Id in ShuffleDependency will be increased, so even if
you run the same job twice, the shuffle dependency as well as shuffle id is
different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name conflict
even in the same directory as I know.

Thanks
Jerry


2015-03-25 1:56 GMT+08:00 Kannan Rajah kra...@maprtech.com:

 I am working on SPARK-1529. I ran into an issue with my change, where the
 same shuffle file was being reused across 2 jobs. Please note this only
 happens when I use a hard coded location to use for shuffle files, say
 /tmp. It does not happen with normal code path that uses DiskBlockManager
 to pick different directories for each run. So I want to understand how
 DiskBlockManager guarantees that such a conflict will never happen.

 Let's say the shuffle block id has a value of shuffle_0_0_0. So the data
 file name is shuffle_0_0_0.data and index file name is shuffle_0_0_0.index.
 If I run a spark job twice, one after another, these files get created
 under different directories because of the hashing logic in
 DiskBlockManager. But the hash is based off the file name, so how are we
 sure that there won't be a conflict ever?

 --
 Kannan

77 matches

Mail list logo