Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-09 Thread Nan Zhu
just curious what happened on google’s spark operator?

On Thu, Nov 9, 2023 at 19:12 Ilan Filonenko  wrote:

> +1
>
> On Thu, Nov 9, 2023 at 7:43 PM Ryan Blue  wrote:
>
>> +1
>>
>> On Thu, Nov 9, 2023 at 4:23 PM Hussein Awala  wrote:
>>
>>> +1 for creating an official Kubernetes operator for Apache Spark
>>>
>>> On Fri, Nov 10, 2023 at 12:38 AM huaxin gao 
>>> wrote:
>>>
 +1

>>>
 On Thu, Nov 9, 2023 at 3:14 PM DB Tsai  wrote:

> +1
>
> To be completely transparent, I am employed in the same department as
> Zhou at Apple.
>
> I support this proposal, provided that we witness community adoption
> following the release of the Flink Kubernetes operator, streamlining Flink
> deployment on Kubernetes.
>
> A well-maintained official Spark Kubernetes operator is essential for
> our Spark community as well.
>
> DB Tsai  |  https://www.dbtsai.com/
> 
>  |  PGP 42E5B25A8F7A82C1
>
> On Nov 9, 2023, at 12:05 PM, Zhou Jiang 
> wrote:
>
> Hi Spark community,
> I'm reaching out to initiate a conversation about the possibility of
> developing a Java-based Kubernetes operator for Apache Spark. Following 
> the
> operator pattern (
> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
> ),
> Spark users may manage applications and related components seamlessly 
> using
> native tools like kubectl. The primary goal is to simplify the Spark user
> experience on Kubernetes, minimizing the learning curve and operational
> complexities and therefore enable users to focus on the Spark application
> development.
> Although there are several open-source Spark on Kubernetes operators
> available, none of them are officially integrated into the Apache Spark
> project. As a result, these operators may lack active support and
> development for new features. Within this proposal, our aim is to 
> introduce
> a Java-based Spark operator as an integral component of the Apache Spark
> project. This solution has been employed internally at Apple for multiple
> years, operating millions of executors in real production environments. 
> The
> use of Java in this solution is intended to accommodate a wider user and
> contributor audience, especially those who are familiar with Scala.
> Ideally, this operator should have its dedicated repository, similar
> to Spark Connect Golang or Spark Docker, allowing it to maintain a loose
> connection with the Spark release cycle. This model is also followed by 
> the
> Apache Flink Kubernetes operator.
> We believe that this project holds the potential to evolve into a
> thriving community project over the long run. A comparison can be drawn
> with the Flink Kubernetes Operator: Apple has open-sourced internal Flink
> Kubernetes operator, making it a part of the Apache Flink project (
> https://github.com/apache/flink-kubernetes-operator
> ).
> This move has gained wide industry adoption and contributions from the
> community. In a mere year, the Flink operator has garnered more than 600
> stars and has attracted contributions from over 80 contributors. This
> showcases the level of community interest and collaborative momentum that
> can be achieved in similar scenarios.
> More details can be found at SPIP doc : Spark Kubernetes Operator
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> 

Re: ASF policy violation and Scala version issues

2023-06-07 Thread Nan Zhu
 for EMR, I think they show 3.1.2-amazon in Spark UI, no?


On Wed, Jun 7, 2023 at 11:30 Grisha Weintraub 
wrote:

> Hi,
>
> I am not taking sides here, but just for fairness, I think it should be
> noted that AWS EMR does exactly the same thing.
> We choose the EMR version (e.g., 6.4.0) and it has an associated Spark
> version (e.g., 3.1.2).
> The Spark version here is not the original Apache version but AWS Spark
> distribution.
>
> On Wed, Jun 7, 2023 at 8:24 PM Dongjoon Hyun 
> wrote:
>
>> I disagree with you in several ways.
>>
>> The following is not a *minor* change like the given examples
>> (alterations to the start-up and shutdown scripts, configuration files,
>> file layout etc.).
>>
>> > The change you cite meets the 4th point, minor change, made for
>> integration reasons.
>>
>> The following is also wrong. There is no such point of state of Apache
>> Spark 3.4.0 after 3.4.0 tag creation. Apache Spark community didn't allow
>> Scala reverting patches in both `master` branch and `branch-3.4`.
>>
>> > There is no known technical objection; this was after all at one point
>> the state of Apache Spark.
>>
>> Is the following your main point? So, you are selling a box "including
>> Harry Potter by J. K. Rolling whose main character is Barry instead of
>> Harry", but it's okay because you didn't sell the book itself? And, as a
>> cloud-vendor, you borrowed the box instead of selling it like private
>> libraries?
>>
>> > There is no standalone distribution of Apache Spark anywhere here.
>>
>> We are not asking a big thing. Why are you so reluctant to say you are
>> not "Apache Spark 3.4.0" by simply saying "Apache Spark 3.4.0-databricks".
>> What is the marketing reason here?
>>
>> Dongjoon.
>>
>>
>> On Wed, Jun 7, 2023 at 9:27 AM Sean Owen  wrote:
>>
>>> Hi Dongjoon, I think this conversation is not advancing anymore. I
>>> personally consider the matter closed unless you can find other support or
>>> respond with more specifics. While this perhaps should be on private@,
>>> I think it's not wrong as an instructive discussion on dev@.
>>>
>>> I don't believe you've made a clear argument about the problem, or how
>>> it relates specifically to policy. Nevertheless I will show you my logic.
>>>
>>> You are asserting that a vendor cannot call a product Apache Spark 3.4.0
>>> if it omits a patch updating a Scala maintenance version. This difference
>>> has no known impact on usage, as far as I can tell.
>>>
>>> Let's see what policy requires:
>>>
>>> 1/ All source code changes must meet at least one of the acceptable
>>> changes criteria set out below:
>>> - The change has accepted by the relevant Apache project community for
>>> inclusion in a future release. Note that the process used to accept changes
>>> and how that acceptance is documented varies between projects.
>>> - A change is a fix for an undisclosed security issue; and the fix is
>>> not publicly disclosed as as security fix; and the Apache project has been
>>> notified of the both issue and the proposed fix; and the PMC has rejected
>>> neither the vulnerability report nor the proposed fix.
>>> - A change is a fix for a bug; and the Apache project has been notified
>>> of both the bug and the proposed fix; and the PMC has rejected neither the
>>> bug report nor the proposed fix.
>>> - Minor changes (e.g. alterations to the start-up and shutdown scripts,
>>> configuration files, file layout etc.) to integrate with the target
>>> platform providing the Apache project has not objected to those changes.
>>>
>>> The change you cite meets the 4th point, minor change, made for
>>> integration reasons. There is no known technical objection; this was after
>>> all at one point the state of Apache Spark.
>>>
>>>
>>> 2/ A version number must be used that both clearly differentiates it
>>> from an Apache Software Foundation release and clearly identifies the
>>> Apache Software Foundation version on which the software is based.
>>>
>>> Keep in mind the product here is not "Apache Spark", but the "Databricks
>>> Runtime 13.1 (including Apache Spark 3.4.0)". That is, there is far more
>>> than a version number differentiating this product from Apache Spark. There
>>> is no standalone distribution of Apache Spark anywhere here. I believe that
>>> easily matches the intent.
>>>
>>>
>>> 3/ The documentation must clearly identify the Apache Software
>>> Foundation version on which the software is based.
>>>
>>> Clearly, yes.
>>>
>>>
>>> 4/ The end user expects that the distribution channel will back-port
>>> fixes. It is not necessary to back-port all fixes. Selection of fixes to
>>> back-port must be consistent with the update policy of that distribution
>>> channel.
>>>
>>> I think this is safe to say too. Indeed this explicitly contemplates not
>>> back-porting a change.
>>>
>>>
>>> Backing up, you can see from this document that the spirit of it is:
>>> don't include changes in your own Apache Foo x.y that aren't wanted by the
>>> project, and still 

Re: Spark 2.4.5 release for Parquet and Avro dependency updates?

2019-11-22 Thread Nan Zhu
I am not sure if it is a good practice to have breaking changes in
dependencies for maintenance releases

On Fri, Nov 22, 2019 at 8:56 AM Michael Heuer  wrote:

> Hello,
>
> Avro 1.8.2 to 1.9.1 is a binary incompatible update, and it appears that
> Parquet 1.10.1 to 1.11 will be a runtime-incompatible update (see thread on
> dev@parquet
> 
> ).
>
> Might there be any desire to cut a Spark 2.4.5 release so that users can
> pick up these changes independently of all the other changes in Spark 3.0?
>
> Thank you in advance,
>
>michael
>


Re: Time to cut an Apache 2.4.1 release?

2019-02-12 Thread Nan Zhu
just filed a JIRA in https://issues.apache.org/jira/browse/SPARK-26862
'
this issue only happens in 2.4.0 but not in 2.3.2

anyone would help to look into that?



On Tue, Feb 12, 2019 at 10:41 AM DB Tsai  wrote:

> Great. I'll prepare the release for voting. Thanks!
>
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
>
> > On Feb 12, 2019, at 4:11 AM, Wenchen Fan  wrote:
> >
> > +1 for 2.4.1
> >
> > On Tue, Feb 12, 2019 at 7:55 PM Hyukjin Kwon 
> wrote:
> > +1 for 2.4.1
> >
> > 2019년 2월 12일 (화) 오후 4:56, Dongjin Lee 님이 작성:
> > > SPARK-23539 is a non-trivial improvement, so probably would not be
> back-ported to 2.4.x.
> >
> > Got it. It seems reasonable.
> >
> > Committers:
> >
> > Please don't omit SPARK-23539 from 2.5.0. Kafka community needs this
> feature.
> >
> > Thanks,
> > Dongjin
> >
> > On Tue, Feb 12, 2019 at 1:50 PM Takeshi Yamamuro 
> wrote:
> > +1, too.
> > branch-2.4 accumulates too many commits..:
> >
> https://github.com/apache/spark/compare/0a4c03f7d084f1d2aa48673b99f3b9496893ce8d...af3c7111efd22907976fc8bbd7810fe3cfd92092
> >
> > On Tue, Feb 12, 2019 at 12:36 PM Dongjoon Hyun 
> wrote:
> > Thank you, DB.
> >
> > +1, Yes. It's time for preparing 2.4.1 release.
> >
> > Bests,
> > Dongjoon.
> >
> > On 2019/02/12 03:16:05, Sean Owen  wrote:
> > > I support a 2.4.1 release now, yes.
> > >
> > > SPARK-23539 is a non-trivial improvement, so probably would not be
> > > back-ported to 2.4.x.SPARK-26154 does look like a bug whose fix could
> > > be back-ported, but that's a big change. I wouldn't hold up 2.4.1 for
> > > it, but it could go in if otherwise ready.
> > >
> > >
> > > On Mon, Feb 11, 2019 at 5:20 PM Dongjin Lee 
> wrote:
> > > >
> > > > Hi DB,
> > > >
> > > > Could you add SPARK-23539[^1] into 2.4.1? I opened the PR[^2] a
> little bit ago, but it has not included in 2.3.0 nor get enough review.
> > > >
> > > > Thanks,
> > > > Dongjin
> > > >
> > > > [^1]: https://issues.apache.org/jira/browse/SPARK-23539
> > > > [^2]: https://github.com/apache/spark/pull/22282
> > > >
> > > > On Tue, Feb 12, 2019 at 6:28 AM Jungtaek Lim 
> wrote:
> > > >>
> > > >> Given SPARK-26154 [1] is a correctness issue and PR [2] is
> submitted, I hope it can be reviewed and included within Spark 2.4.1 -
> otherwise it will be a long-live correctness issue.
> > > >>
> > > >> Thanks,
> > > >> Jungtaek Lim (HeartSaVioR)
> > > >>
> > > >> 1. https://issues.apache.org/jira/browse/SPARK-26154
> > > >> 2. https://github.com/apache/spark/pull/23634
> > > >>
> > > >>
> > > >> 2019년 2월 12일 (화) 오전 6:17, DB Tsai 님이 작성:
> > > >>>
> > > >>> Hello all,
> > > >>>
> > > >>> I am preparing to cut a new Apache 2.4.1 release as there are many
> bugs and correctness issues fixed in branch-2.4.
> > > >>>
> > > >>> The list of addressed issues are
> https://issues.apache.org/jira/browse/SPARK-26583?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.4.1%20order%20by%20updated%20DESC
> > > >>>
> > > >>> Let me know if you have any concern or any PR you would like to
> get in.
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >>>
> -
> > > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >>>
> > > >
> > > >
> > > > --
> > > > Dongjin Lee
> > > >
> > > > A hitchhiker in the mathematical world.
> > > >
> > > > github: github.com/dongjinleekr
> > > > linkedin: kr.linkedin.com/in/dongjinleekr
> > > > speakerdeck: speakerdeck.com/dongjin
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> >
> > --
> > ---
> > Takeshi Yamamuro
> >
> >
> > --
> > Dongjin Lee
> >
> > A hitchhiker in the mathematical world.
> >
> > github: github.com/dongjinleekr
> > linkedin: kr.linkedin.com/in/dongjinleekr
> > speakerdeck: speakerdeck.com/dongjin
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Nan Zhu
.how I skipped the last part

On Tue, May 8, 2018 at 11:16 AM, Reynold Xin <r...@databricks.com> wrote:

> Yes, Nan, totally agree. To be on the same page, that's exactly what I
> wrote wasn't it?
>
> On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcg...@gmail.com> wrote:
>
>> besides that, one of the things which is needed by multiple frameworks is
>> to schedule tasks in a single wave
>>
>> i.e.
>>
>> if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
>> is desired to provide a capability to ensure that either we run 50 tasks at
>> once, or we should quit the complete application/job after some timeout
>> period
>>
>> Best,
>>
>> Nan
>>
>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> I think that's what Xiangrui was referring to. Instead of retrying a
>>> single task, retry the entire stage, and the entire stage of tasks need to
>>> be scheduled all at once.
>>>
>>>
>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>>>
>>>>>
>>>>>>- Fault tolerance and execution model: Spark assumes fine-grained
>>>>>>task recovery, i.e. if something fails, only that task is rerun. This
>>>>>>doesn’t match the execution model of distributed ML/DL frameworks 
>>>>>> that are
>>>>>>typically MPI-based, and rerunning a single task would lead to the 
>>>>>> entire
>>>>>>system hanging. A whole stage needs to be re-run.
>>>>>>
>>>>>> This is not only useful for integrating with 3rd-party frameworks,
>>>>> but also useful for scaling MLlib algorithms. One of my earliest attempts
>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
>>>>> with some compromised solutions. With the new execution model, we can set
>>>>> up a hybrid cluster and do all-reduce properly.
>>>>>
>>>>>
>>>> Is there a particular new execution model you are referring to or do we
>>>> plan to investigate a new execution model ?  For the MPI-like model, we
>>>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>>>> and I dont think we have support for that in the scheduler right now.
>>>>
>>>>>
>>>>>> --
>>>>>
>>>>> Xiangrui Meng
>>>>>
>>>>> Software Engineer
>>>>>
>>>>> Databricks Inc. [image: http://databricks.com]
>>>>> <http://databricks.com/>
>>>>>
>>>>
>>>>
>>


Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Nan Zhu
besides that, one of the things which is needed by multiple frameworks is
to schedule tasks in a single wave

i.e.

if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
is desired to provide a capability to ensure that either we run 50 tasks at
once, or we should quit the complete application/job after some timeout
period

Best,

Nan

On Tue, May 8, 2018 at 11:10 AM, Reynold Xin  wrote:

> I think that's what Xiangrui was referring to. Instead of retrying a
> single task, retry the entire stage, and the entire stage of tasks need to
> be scheduled all at once.
>
>
> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>>
>>>
- Fault tolerance and execution model: Spark assumes fine-grained
task recovery, i.e. if something fails, only that task is rerun. This
doesn’t match the execution model of distributed ML/DL frameworks that 
 are
typically MPI-based, and rerunning a single task would lead to the 
 entire
system hanging. A whole stage needs to be re-run.

 This is not only useful for integrating with 3rd-party frameworks, but
>>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>> ). But we ended up
>>> with some compromised solutions. With the new execution model, we can set
>>> up a hybrid cluster and do all-reduce properly.
>>>
>>>
>> Is there a particular new execution model you are referring to or do we
>> plan to investigate a new execution model ?  For the MPI-like model, we
>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>> and I dont think we have support for that in the scheduler right now.
>>
>>>
 --
>>>
>>> Xiangrui Meng
>>>
>>> Software Engineer
>>>
>>> Databricks Inc. [image: http://databricks.com] 
>>>
>>
>>


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-26 Thread Nan Zhu
+1  (non-binding), tested with internal workloads and benchmarks

On Mon, Feb 26, 2018 at 12:09 PM, Michael Armbrust 
wrote:

> +1 all our pipelines have been running the RC for several days now.
>
> On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun 
> wrote:
>
>> +1 (non-binding).
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li  wrote:
>>>
 +1 (binding) in Spark SQL, Core and PySpark.

 Xiao

 2018-02-24 14:49 GMT-08:00 Ricardo Almeida <
 ricardo.alme...@actnowib.com>:

> +1 (non-binding)
>
> same as previous RC
>
> On 24 February 2018 at 11:10, Hyukjin Kwon 
> wrote:
>
>> +1
>>
>> 2018-02-24 16:57 GMT+09:00 Bryan Cutler :
>>
>>> +1
>>> Tests passed and additionally ran Arrow related tests and did some
>>> perf checks with python 2.7.14
>>>
>>> On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau 
>>> wrote:
>>>
 Note: given the state of Jenkins I'd love to see Bryan Cutler or
 someone with Arrow experience sign off on this release.

 On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian 
 wrote:

> +1 (binding)
>
> Passed all the tests, looks good.
>
> Cheng
>
> On 2/23/18 15:00, Holden Karau wrote:
>
> +1 (binding)
> PySpark artifacts install in a fresh Py3 virtual env
>
> On Feb 23, 2018 7:55 AM, "Denny Lee" 
> wrote:
>
>> +1 (non-binding)
>>
>> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
>> joshgoldsboroughs...@gmail.com> wrote:
>>
>>> New to testing out Spark RCs for the community but I was able to
>>> run some of the basic unit tests without error so for what it's 
>>> worth, I'm
>>> a +1.
>>>
>>> On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal <
>>> samee...@apache.org> wrote:
>>>
 Please vote on releasing the following candidate as Apache
 Spark version 2.3.0. The vote is open until Tuesday February 27, 
 2018 at
 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 votes 
 are cast.


 [ ] +1 Release this package as Apache Spark 2.3.0

 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see
 https://spark.apache.org/

 The tag to be voted on is v2.3.0-rc5:
 https://github.com/apache/spark/tree/v2.3.0-rc5
 (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)

 List of JIRA tickets resolved in this release can be found
 here: https://issues.apache.org/jira
 /projects/SPARK/versions/12339551

 The release files, including signatures, digests, etc. can be
 found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/

 Release artifacts are signed with the following key:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapache
 spark-1266/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
 /_site/index.html


 FAQ

 ===
 What are the unresolved issues targeted for 2.3.0?
 ===

 Please see https://s.apache.org/oXKi. At the time of writing,
 there are currently no known release blockers.

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by
 taking an existing Spark workload and running on this release 
 candidate,
 then reporting any regressions.

 If you're working in PySpark you can set up a virtual env and
 install the current RC and see if anything important breaks, in the
 Java/Scala you can add the staging repository to your projects 
 resolvers
 and test with the RC (make sure to clean up the 

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu
nvm

On Tue, Jan 9, 2018 at 9:42 AM, Nan Zhu <zhunanmcg...@gmail.com> wrote:

> Hi, all
>
> Out of curious, I just found a bunch of Palantir release under
> org.apache.spark in maven central (https://mvnrepository.com/
> artifact/org.apache.spark/spark-core_2.11)?
>
> Is it on purpose?
>
> Best,
>
> Nan
>
>
>


Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu
Hi, all

Out of curious, I just found a bunch of Palantir release under
org.apache.spark in maven central (
https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11)?

Is it on purpose?

Best,

Nan


Request for review of SPARK-22599

2017-11-29 Thread Nan Zhu
Hi, all

When we do perf test for Spark, we found that enabling table cache does not
bring the expected speedup comparing to cloud-storage + parquet in many
scenarios. We identified that the performance cost is brought by the fact
that the current InMemoryRelation/InMemorytTableScanExec will traverse the
complete cached table even for the highly selective queries. Comparing the
parquet which utilizes file footer to skip the unnecessary parts of the
file, the execution with cached table is slower.

We have filed JIRA in https://issues.apache.org/jira/browse/SPARK-22599 and
have the corresponding PR in https://github.com/apache/spark/pull/19810
(design doc:
https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing,
which is also linked in JIRA/PR)

Our performance evaluation suggests that we gain up to 41% speedup
comparing to the current implementation (
https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing
)

Please share your thoughts to help us to improve the optimization for the
in-memory table scanning in Spark

Best,

Nan


Re: Outstanding Spark 2.1.1 issues

2017-03-20 Thread Nan Zhu
I think https://issues.apache.org/jira/browse/SPARK-19280 should be a
blocker

Best,

Nan

On Mon, Mar 20, 2017 at 8:18 PM, Felix Cheung 
wrote:

> I've been scrubbing R and think we are tracking 2 issues
>
> https://issues.apache.org/jira/browse/SPARK-19237
>
> https://issues.apache.org/jira/browse/SPARK-19925
>
>
>
>
> --
> *From:* holden.ka...@gmail.com  on behalf of
> Holden Karau 
> *Sent:* Monday, March 20, 2017 3:12:35 PM
> *To:* dev@spark.apache.org
> *Subject:* Outstanding Spark 2.1.1 issues
>
> Hi Spark Developers!
>
> As we start working on the Spark 2.1.1 release I've been looking at our
> outstanding issues still targeted for it. I've tried to break it down by
> component so that people in charge of each component can take a quick look
> and see if any of these things can/should be re-targeted to 2.2 or 2.1.2 &
> the overall list is pretty short (only 9 items - 5 if we only look at
> explicitly tagged) :)
>
> If your working on something for Spark 2.1.1 and it doesn't show up in
> this list please speak up now :) We have a lot of issues (including "in
> progress") that are listed as impacting 2.1.0, but they aren't targeted for
> 2.1.1 - if there is something you are working in their which should be
> targeted for 2.1.1 please let us know so it doesn't slip through the cracks.
>
> The query string I used for looking at the 2.1.1 open issues is:
>
> ((affectedVersion = 2.1.1 AND cf[12310320] is Empty) OR fixVersion = 2.1.1
> OR cf[12310320] = "2.1.1") AND project = spark AND resolution = Unresolved
> ORDER BY priority DESC
>
> None of the open issues appear to be a regression from 2.1.0, but those
> seem more likely to show up during the RC process (thanks in advance to
> everyone testing their workloads :)) & generally none of them seem to be
>
> (Note: the cfs are for Target Version/s field)
>
> Critical Issues:
>  SQL:
>   SPARK-19690  - Join
> a streaming DataFrame with a batch DataFrame may not work - PR
> https://github.com/apache/spark/pull/17052 (review in progress by
> zsxwing, currently failing Jenkins)*
>
> Major Issues:
>  SQL:
>   SPARK-19035  - rand()
> function in case when cause failed - no outstanding PR (consensus on JIRA
> seems to be leaning towards it being a real issue but not necessarily
> everyone agrees just yet - maybe we should slip this?)*
>  Deploy:
>   SPARK-19522 
>  - --executor-memory flag doesn't work in local-cluster mode -
> https://github.com/apache/spark/pull/16975 (review in progress by vanzin,
> but PR currently stalled waiting on response) *
>  Core:
>   SPARK-20025  - Driver
> fail over will not work, if SPARK_LOCAL* env is set. -
> https://github.com/apache/spark/pull/17357 (waiting on review) *
>  PySpark:
>  SPARK-19955  - Update
> run-tests to support conda [ Part of Dropping 2.6 support -- which we
> shouldn't do in a minor release -- but also fixes pip installability tests
> to run in Jenkins ]-  PR failing Jenkins (I need to poke this some more,
> but seems like 2.7 support works but some other issues. Maybe slip to 2.2?)
>
> Minor issues:
>  Tests:
>   SPARK-19612  - Tests
> failing with timeout - No PR per-se but it seems unrelated to the 2.1.1
> release. It's not targetted for 2.1.1 but listed as affecting 2.1.1 - I'd
> consider explicitly targeting this for 2.2?
>  PySpark:
>   SPARK-19570  - Allow
> to disable hive in pyspark shell - https://github.com/apache/sp
> ark/pull/16906 PR exists but its difficult to add automated tests for
> this (although if SPARK-19955
>  gets in would make
> testing this easier) - no reviewers yet. Possible re-target?*
>  Structured Streaming:
>   SPARK-19613  - Flaky
> test: StateStoreRDDSuite.versioning and immutability - It's not targetted
> for 2.1.1 but listed as affecting 2.1.1 - I'd consider explicitly targeting
> this for 2.2?
>  ML:
>   SPARK-19759 
>  - ALSModel.predict on Dataframes : potential optimization by not using
> blas - No PR consider re-targeting unless someone has a PR waiting in the
> wings?
>
> Explicitly targeted issues are marked with a *, the remaining issues are
> listed as impacting 2.1.1 and don't have a specific target version set.
>
> Since 2.1.1 continues the 2.1.0 branch, looking at 2.1.0 shows 1 open
> blocker in SQL( SPARK-19983
>  ),
>
> Query string is:
>
> affectedVersion = 2.1.0 AND cf[12310320] is EMPTY AND project = 

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Nan Zhu
Congratulations!

On Tue, Jan 24, 2017 at 4:50 PM, Hyukjin Kwon  wrote:

> Congratuation!!
>
> 2017-01-25 9:22 GMT+09:00 Takeshi Yamamuro :
>
>> Congrats!
>>
>> // maropu
>>
>> On Wed, Jan 25, 2017 at 9:20 AM, Kousuke Saruta <
>> saru...@oss.nttdata.co.jp> wrote:
>>
>>> Congrats, Burak and Holden!
>>>
>>> - Kousuke
>>>
>>> On 2017/01/25 6:36, Herman van Hövell tot Westerflier wrote:
>>>
>>> Congrats!
>>>
>>> On Tue, Jan 24, 2017 at 10:20 PM, Felix Cheung <
>>> felixcheun...@hotmail.com> wrote:
>>>
 Congrats and welcome!!


 --
 *From:* Reynold Xin 
 *Sent:* Tuesday, January 24, 2017 10:13:16 AM
 *To:* dev@spark.apache.org
 *Cc:* Burak Yavuz; Holden Karau
 *Subject:* welcoming Burak and Holden as committers

 Hi all,

 Burak and Holden have recently been elected as Apache Spark committers.

 Burak has been very active in a large number of areas in Spark,
 including linear algebra, stats/maths functions in DataFrames, Python/R
 APIs for DataFrames, dstream, and most recently Structured Streaming.

 Holden has been a long time Spark contributor and evangelist. She has
 written a few books on Spark, as well as frequent contributions to the
 Python API to improve its usability and performance.

 Please join me in welcoming the two!



>>>
>>>
>>> --
>>>
>>>
>>> [image: Register today for Spark Summit East 2017!]
>>> 
>>>
>>> Herman van Hövell
>>>
>>> Software Engineer
>>>
>>> Databricks Inc.
>>>
>>> hvanhov...@databricks.com
>>>
>>> +31 6 420 590 27
>>>
>>> databricks.com
>>>
>>> [image: http://databricks.com] 
>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>


Re: Welcoming Yanbo Liang as a committer

2016-06-03 Thread Nan Zhu
Congratulations !

-- 
Nan Zhu
On June 3, 2016 at 10:50:33 PM, Ted Yu (yuzhih...@gmail.com) wrote:

Congratulations, Yanbo.

On Fri, Jun 3, 2016 at 7:48 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
Hi all,

The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been a 
super active contributor in many areas of MLlib. Please join me in welcoming 
Yanbo!

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org




Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-03-15 Thread Nan Zhu
Dear Spark Users and Developers, 

We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy 
to announce the release of XGBoost4J 
(http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html),
 a Portable Distributed XGBoost in Spark, Flink and Dataflow 

XGBoost is an optimized distributed gradient boosting library designed to be 
highly efficient, flexible and portable.XGBoost provides a parallel tree 
boosting (also known as GBDT, GBM) that solve many data science problems in a 
fast and accurate way. It has been the winning solution for many machine 
learning scenarios, ranging from Machine Learning Challenges 
(https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions)
 to Industrial User Cases 
(https://github.com/dmlc/xgboost/tree/master/demo#usecases) 

XGBoost4J is a new package in XGBoost aiming to provide the clean Scala/Java 
APIs and the seamless integration with the mainstream data processing platform, 
like Apache Spark. With XGBoost4J, users can run XGBoost as a stage of Spark 
job and build a unified pipeline from ETL to Model training to data product 
service within Spark, instead of jumping across two different systems, i.e. 
XGBoost and Spark. (Example: 
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/DistTrainWithSpark.scala)

Today, we release the first version of XGBoost4J to bring more choices to the 
Spark users who are seeking the solutions to build highly efficient data 
analytic platform and enrich the Spark ecosystem. We will keep moving forward 
to integrate with more features of Spark. Of course, you are more than welcome 
to join us and contribute to the project!

For more details of distributed XGBoost, you can refer to the recently 
published paper: http://arxiv.org/abs/1603.02754

Best, 

-- 
Nan Zhu
http://codingcat.me



tests blocked at "don't call ssc.stop in listener"

2015-11-26 Thread Nan Zhu
Hi, all

Anyone noticed that some of the tests just blocked at the test case “don't call 
ssc.stop in listener” in StreamingListenerSuite?

Examples:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46766/console

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46776/console


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46774/console


I originally found it in my own PR, and I thought it is a bug introduced by 
me….but later I found that the tests for the PRs on different things also 
blocked at the same point …

Just filed a JIRA https://issues.apache.org/jira/browse/SPARK-12021


Best,  

--  
Nan Zhu
http://codingcat.me



Re: A proposal for Spark 2.0

2015-11-12 Thread Nan Zhu
Being specific to Parameter Server, I think the current agreement is that PS 
shall exist as a third-party library instead of a component of the core code 
base, isn’t?

Best,  

--  
Nan Zhu
http://codingcat.me


On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:

> Who has the idea of machine learning? Spark missing some features for machine 
> learning, For example, the parameter server.
>  
>  
> > 在 2015年11月12日,05:32,Matei Zaharia <matei.zaha...@gmail.com 
> > (mailto:matei.zaha...@gmail.com)> 写道:
> >  
> > I like the idea of popping out Tachyon to an optional component too to 
> > reduce the number of dependencies. In the future, it might even be useful 
> > to do this for Hadoop, but it requires too many API changes to be worth 
> > doing now.
> >  
> > Regarding Scala 2.12, we should definitely support it eventually, but I 
> > don't think we need to block 2.0 on that because it can be added later too. 
> > Has anyone investigated what it would take to run on there? I imagine we 
> > don't need many code changes, just maybe some REPL stuff.
> >  
> > Needless to say, but I'm all for the idea of making "major" releases as 
> > undisruptive as possible in the model Reynold proposed. Keeping everyone 
> > working with the same set of releases is super important.
> >  
> > Matei
> >  
> > > On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com 
> > > (mailto:so...@cloudera.com)> wrote:
> > >  
> > > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com 
> > > (mailto:r...@databricks.com)> wrote:
> > > > to the Spark community. A major release should not be very different 
> > > > from a
> > > > minor release and should not be gated based on new features. The main
> > > > purpose of a major release is an opportunity to fix things that are 
> > > > broken
> > > > in the current API and remove certain deprecated APIs (examples follow).
> > > >  
> > >  
> > >  
> > > Agree with this stance. Generally, a major release might also be a
> > > time to replace some big old API or implementation with a new one, but
> > > I don't see obvious candidates.
> > >  
> > > I wouldn't mind turning attention to 2.x sooner than later, unless
> > > there's a fairly good reason to continue adding features in 1.x to a
> > > 1.7 release. The scope as of 1.6 is already pretty darned big.
> > >  
> > >  
> > > > 1. Scala 2.11 as the default build. We should still support Scala 2.10, 
> > > > but
> > > > it has been end-of-life.
> > > >  
> > >  
> > >  
> > > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > > dropping 2.10. Otherwise it's supported for 2 more years.
> > >  
> > >  
> > > > 2. Remove Hadoop 1 support.
> > >  
> > > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > > sort of 'alpha' and 'beta' releases) and even <2.6.
> > >  
> > > I'm sure we'll think of a number of other small things -- shading a
> > > bunch of stuff? reviewing and updating dependencies in light of
> > > simpler, more recent dependencies to support from Hadoop etc?
> > >  
> > > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > > Pop out any Docker stuff to another repo?
> > > Continue that same effort for EC2?
> > > Farming out some of the "external" integrations to another repo (?
> > > controversial)
> > >  
> > > See also anything marked version "2+" in JIRA.
> > >  
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > > (mailto:dev-unsubscr...@spark.apache.org)
> > > For additional commands, e-mail: dev-h...@spark.apache.org 
> > > (mailto:dev-h...@spark.apache.org)
> > >  
> >  
> >  
> >  
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> >  
>  
>  
>  
>  
>  
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
>  
>  




Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu
Thank you, Jie! Very nice work!

--  
Nan Zhu
http://codingcat.me


On Friday, June 26, 2015 at 8:17 AM, Huang, Jie wrote:

 Correct. Your calculation is right!  
   
 We have been aware of that kmeans performance drop also. According to our 
 observation, it is caused by some unbalanced executions among different 
 tasks. Even we used the same test data between different versions (i.e., not 
 caused by the data skew).
   
 And the corresponding run time information has been shared with Xiangrui. Now 
 he is also helping to identify the root cause altogether.  
   
 Thank you  Best Regards,
 Grace (Huang Jie)
   
 From: Nan Zhu [mailto:zhunanmcg...@gmail.com]  
 Sent: Friday, June 26, 2015 7:59 PM
 To: Huang, Jie
 Cc: u...@spark.apache.org (mailto:u...@spark.apache.org); 
 dev@spark.apache.org (mailto:dev@spark.apache.org)
 Subject: Re: [SparkScore]Performance portal for Apache Spark - WW26  
   
 Hi, Jie,  
  
   
  
 Thank you very much for this work! Very helpful!
  
   
  
 I just would like to confirm that I understand the numbers correctly: if we 
 take the running time of 1.2 release as 100s
  
   
  
 9.1% - means the running time is 109.1 s?
  
   
  
 -4% - means it comes 96s?
  
   
  
 If that’s the true meaning of the numbers, what happened to k-means in 
 HiBench?
  
   
  
 Best,
  
   
  
 --  
  
 Nan Zhu
  
 http://codingcat.me
  
   
  
  
 On Friday, June 26, 2015 at 7:24 AM, Huang, Jie wrote:
  Intel® Xeon® CPU E5-2697  
   
  
   
  
  
  
  




Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu
Hi, Jie,  

Thank you very much for this work! Very helpful!

I just would like to confirm that I understand the numbers correctly: if we 
take the running time of 1.2 release as 100s

9.1% - means the running time is 109.1 s?

-4% - means it comes 96s?

If that’s the true meaning of the numbers, what happened to k-means in HiBench?

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, June 26, 2015 at 7:24 AM, Huang, Jie wrote:

 Intel® Xeon® CPU E5-2697  




Re: Welcoming three new committers

2015-02-03 Thread Nan Zhu
Congratulations!

--  
Nan Zhu
http://codingcat.me


On Tuesday, February 3, 2015 at 8:08 PM, Xuefeng Wu wrote:

 Congratulations!well done.  
  
 Yours, Xuefeng Wu 吴雪峰 敬上
  
  On 2015年2月4日, at 上午6:34, Matei Zaharia matei.zaha...@gmail.com 
  (mailto:matei.zaha...@gmail.com) wrote:
   
  Hi all,
   
  The PMC recently voted to add three new committers: Cheng Lian, Joseph 
  Bradley and Sean Owen. All three have been major contributors to Spark in 
  the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many 
  pieces throughout Spark Core. Join me in welcoming them as committers!
   
  Matei
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
   
  
  
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
 (mailto:dev-unsubscr...@spark.apache.org)
 For additional commands, e-mail: dev-h...@spark.apache.org 
 (mailto:dev-h...@spark.apache.org)
  
  




Re: missing document of several messages in actor-based receiver?

2015-01-09 Thread Nan Zhu
Hi,  

I have created the PR for these two issues

Best,  

--  
Nan Zhu
http://codingcat.me


On Friday, January 9, 2015 at 7:38 AM, Nan Zhu wrote:

 Thanks, TD,  
  
 I just created 2 JIRAs to track these,  
  
 https://issues.apache.org/jira/browse/SPARK-5174
  
 https://issues.apache.org/jira/browse/SPARK-5175
  
 Can you help to me assign these two JIRAs to me, and I’d like to submit the 
 PRs
  
 Best,  
  
 --  
 Nan Zhu
 http://codingcat.me
  
  
 On Friday, January 9, 2015 at 4:25 AM, Tathagata Das wrote:
  
  It was not really mean to be hidden. So its essentially the case of the 
  documentation being insufficient. This code has not gotten much attention 
  for a while, so it could have a bugs. If you find any and submit a fix for 
  them, I am happy to take a look!
   
  TD
   
  On Thu, Jan 8, 2015 at 6:33 PM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, TD and other streaming developers,

   When I look at the implementation of actor-based receiver 
   (ActorReceiver.scala), I found that there are several messages which are 
   not mentioned in the document  

   case props: Props =
   val worker = context.actorOf(props)
   logInfo(Started receiver worker at: + worker.path)
   sender ! worker

   case (props: Props, name: String) =
   val worker = context.actorOf(props, name)
   logInfo(Started receiver worker at: + worker.path)
   sender ! worker

   case _: PossiblyHarmful = hiccups.incrementAndGet()

   case _: Statistics =
   val workers = context.children
   sender ! Statistics(n.get, workers.size, hiccups.get, 
   workers.mkString(\n”))

   Is it hided with intention or incomplete document, or I missed something?
   And the handler of these messages are “buggy? e.g. when we start a new 
   worker, we didn’t increase n (counter of children), and n and hiccups are 
   unnecessarily set to AtomicInteger ?

   Best,

   --  
   Nan Zhu
   http://codingcat.me


   
   
   
  



missing document of several messages in actor-based receiver?

2015-01-08 Thread Nan Zhu
Hi, TD and other streaming developers,

When I look at the implementation of actor-based receiver 
(ActorReceiver.scala), I found that there are several messages which are not 
mentioned in the document  

case props: Props =
val worker = context.actorOf(props)
logInfo(Started receiver worker at: + worker.path)
sender ! worker

case (props: Props, name: String) =
val worker = context.actorOf(props, name)
logInfo(Started receiver worker at: + worker.path)
sender ! worker

case _: PossiblyHarmful = hiccups.incrementAndGet()

case _: Statistics =
val workers = context.children
sender ! Statistics(n.get, workers.size, hiccups.get, workers.mkString(\n”))

Is it hided with intention or incomplete document, or I missed something?
And the handler of these messages are “buggy? e.g. when we start a new worker, 
we didn’t increase n (counter of children), and n and hiccups are unnecessarily 
set to AtomicInteger ?

Best,

--  
Nan Zhu
http://codingcat.me



Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Nan Zhu
BTW, this PR https://github.com/apache/spark/pull/2524 is related to a blocker 
level bug, 

and this is actually close to be merged (have been reviewed for several rounds)

I would appreciated if anyone can continue the process, 

@mateiz 

-- 
Nan Zhu
http://codingcat.me


On Thursday, November 20, 2014 at 10:17 AM, Corey Nolet wrote:

 I was actually about to post this myself- I have a complex join that could
 benefit from something like a GroupComparator vs having to do multiple
 grouyBy operations. This is probably the wrong thread for a full discussion
 on this but I didn't see a JIRA ticket for this or anything similar- any
 reasons why this would not make sense given Spark's design?
 
 On Thu, Nov 20, 2014 at 9:39 AM, Madhu ma...@madhu.com 
 (mailto:ma...@madhu.com) wrote:
 
  Thanks Patrick.
  
  I've been testing some 1.2 features, looks good so far.
  I have some example code that I think will be helpful for certain MR-style
  use cases (secondary sort).
  Can I still add that to the 1.2 documentation, or is that frozen at this
  point?
  
  
  
  -
  --
  Madhu
  https://www.linkedin.com/in/msiddalingaiah
  --
  View this message in context:
  http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com (http://Nabble.com).
  
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
 
 
 
 




Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Nan Zhu
+1, with a question

Will these maintainers have a cleanup for those pending PRs upon we start to 
apply this model? there are some patches always being there but haven’t been  
merged, some of which are periodically maintained (rebase, ping , etc….), the 
others are just phased out  

Best,  

--  
Nan Zhu


On Wednesday, November 5, 2014 at 8:33 PM, Matei Zaharia wrote:

 BTW, my own vote is obviously +1 (binding).
  
 Matei
  
  On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com 
  (mailto:matei.zaha...@gmail.com) wrote:
   
  Hi all,
   
  I wanted to share a discussion we've been having on the PMC list, as well 
  as call for an official vote on it on a public list. Basically, as the 
  Spark project scales up, we need to define a model to make sure there is 
  still great oversight of key components (in particular internal 
  architecture and public APIs), and to this end I've proposed implementing a 
  maintainer model for some of these components, similar to other large 
  projects.
   
  As background on this, Spark has grown a lot since joining Apache. We've 
  had over 80 contributors/month for the past 3 months, which I believe makes 
  us the most active project in contributors/month at Apache, as well as over 
  500 patches/month. The codebase has also grown significantly, with new 
  libraries for SQL, ML, graphs and more.
   
  In this kind of large project, one common way to scale development is to 
  assign maintainers to oversee key components, where each patch to that 
  component needs to get sign-off from at least one of its maintainers. Most 
  existing large projects do this -- at Apache, some large ones with this 
  model are CloudStack (the second-most active project overall), Subversion, 
  and Kafka, and other examples include Linux and Python. This is also 
  by-and-large how Spark operates today -- most components have a de-facto 
  maintainer.
   
  IMO, adopting this model would have two benefits:
   
  1) Consistent oversight of design for that component, especially regarding 
  architecture and API. This process would ensure that the component's 
  maintainers see all proposed changes and consider them to fit together in a 
  good way.
   
  2) More structure for new contributors and committers -- in particular, it 
  would be easy to look up who’s responsible for each module and ask them for 
  reviews, etc, rather than having patches slip between the cracks.
   
  We'd like to start with in a light-weight manner, where the model only 
  applies to certain key components (e.g. scheduler, shuffle) and user-facing 
  APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand 
  it if we deem it useful. The specific mechanics would be as follows:
   
  - Some components in Spark will have maintainers assigned to them, where 
  one of the maintainers needs to sign off on each patch to the component.
  - Each component with maintainers will have at least 2 maintainers.
  - Maintainers will be assigned from the most active and knowledgeable 
  committers on that component by the PMC. The PMC can vote to add / remove 
  maintainers, and maintained components, through consensus.
  - Maintainers are expected to be active in responding to patches for their 
  components, though they do not need to be the main reviewers for them (e.g. 
  they might just sign off on architecture / API). To prevent inactive 
  maintainers from blocking the project, if a maintainer isn't responding in 
  a reasonable time period (say 2 weeks), other committers can merge the 
  patch, and the PMC will want to discuss adding another maintainer.
   
  If you'd like to see examples for this model, check out the following 
  projects:
  - CloudStack: 
  https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
   
  https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide

  - Subversion: https://subversion.apache.org/docs/community-guide/roles.html 
  https://subversion.apache.org/docs/community-guide/roles.html
   
  Finally, I wanted to list our current proposal for initial components and 
  maintainers. It would be good to get feedback on other components we might 
  add, but please note that personnel discussions (e.g. I don't think Matei 
  should maintain *that* component) should only happen on the private list. 
  The initial components were chosen to include all public APIs and the main 
  core components, and the maintainers were chosen from the most active 
  contributors to those modules.
   
  - Spark core public API: Matei, Patrick, Reynold
  - Job scheduler: Matei, Kay, Patrick
  - Shuffle and network: Reynold, Aaron, Matei
  - Block manager: Reynold, Aaron
  - YARN: Tom, Andrew Or
  - Python: Josh, Matei
  - MLlib: Xiangrui, Matei
  - SQL: Michael, Reynold
  - Streaming: TD, Matei
  - GraphX: Ankur, Joey, Reynold
   
  I'd like to formally call a [VOTE] on this model, to last 72 hours. The 
  [VOTE

Re: serialVersionUID incompatible error in class BlockManagerId

2014-10-24 Thread Nan Zhu
According to my experience, there are more issues rather than BlockManager when 
you try to run spark application whose build version is different with your 
cluster….  

I once tried to make jdbc server build with branch-jdbc-1.0 run with a 
branch-1.0 cluster…no workaround exits…just had to replace cluster jar with 
branch-jdbc-1.0 jar file…..

Best,  

--  
Nan Zhu


On Friday, October 24, 2014 at 9:23 PM, Josh Rosen wrote:

 Are all processes (Master, Worker, Executors, Driver) running the same Spark 
 build?  This error implies that you’re seeing protocol / binary 
 incompatibilities between your Spark driver and cluster.
  
 Spark is API-compatibile across the 1.x series, but we don’t make binary 
 link-level compatibility guarantees: 
 https://cwiki.apache.org/confluence/display/SPARK/Spark+Versioning+Policy.  
 This means that your Spark driver’s runtime classpath should use the same 
 version of Spark that’s installed on your cluster.  You can compile against a 
 different API-compatible version of Spark, but the runtime versions must 
 match across all components.
  
 To fix this issue, I’d check that you’ve run the “package” and “assembly” 
 phases and that your Spark cluster is using this updated version.
  
 - Josh
  
 On October 24, 2014 at 6:17:26 PM, Qiuzhuang Lian (qiuzhuang.l...@gmail.com 
 (mailto:qiuzhuang.l...@gmail.com)) wrote:
  
 Hi,  
  
 I update git today and when connecting to spark cluster, I got  
 the serialVersionUID incompatible error in class BlockManagerId.  
  
 Here is the log,  
  
 Shouldn't we better give BlockManagerId a constant serialVersionUID avoid  
 this?  
  
 Thanks,  
 Qiuzhuang  
  
 scala val rdd = sc.parparallelize(1 to 100014/10/25 09:10:48 ERROR  
 Remoting: org.apache.spark.storage.BlockManagerId; local class  
 incompatible: stream classdesc serialVersionUID = 2439208141545036836,  
 local class serialVersionUID = 4657685702603429489  
 java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId;  
 local class incompatible: stream classdesc serialVersionUID =  
 2439208141545036836, local class serialVersionUID = 4657685702603429489  
 at  
 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)  
 at  
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)  
 at  
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)  
 at  
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)  
 at  
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)  
 at  
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)  
 at  
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)  
 at  
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)  
 at  
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)  
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)  
 at  
 akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)  
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)  
 at  
 akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)  
 at  
 akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
   
 at scala.util.Try$.apply(Try.scala:161)  
 at  
 akka.serialization.Serialization.deserialize(Serialization.scala:98)  
 at  
 akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)  
 at  
 akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)  
 at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)  
 at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)  
 at  
 akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937) 
  
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)  
 at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)  
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)  
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)  
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)  
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)  
 at  
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   
 at  
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)  
 at  
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   
 at  
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)  
 at  
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
   
 14/10/25 09:10:48 ERROR SparkDeploySchedulerBackend: Asked to remove non  
 existant executor 1  
 0014/10/25 09:11:21 ERROR Remoting:  
 org.apache.spark.storage.BlockManagerId; local class incompatible: stream  
 classdesc serialVersionUID = 2439208141545036836, local class  
 serialVersionUID = 4657685702603429489  
 java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId;  
 local class

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Nan Zhu
just curious…what is this “NewSparkPullRequestBuilder”?  

Best,  

--  
Nan Zhu


On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:

  
 Hm, seems that 7u71 comes back again. Observed similar Kinesis compilation 
 error just now: 
 https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull
  
  
 Checked Jenkins slave nodes, saw /usr/java/latest points to jdk1.7.0_71. 
 However, /usr/bin/javac -version says:
  
   
  Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM Corp 2000, 
  2008. All rights reserved.
   
  
  
 Which JDK is actually used by Jenkins?
  
  
 Cheng
  
  
 On 10/21/14 8:28 AM, shane knapp wrote:
  
  
  
  
  
  ok, so earlier today i installed a 2nd JDK within jenkins (7u71), which 
  fixed the SparkR build but apparently made Spark itself quite unhappy. i 
  removed that JDK, triggered a build ( 
  https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
   and it compiled kinesis w/o dying a fiery death. apparently 7u71 is 
  stricter when compiling. sad times. sorry about that! shane On Mon, Oct 20, 
  2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com 
  (mailto:pwend...@gmail.com) wrote:  
   The failure is in the Kinesis compoent, can you reproduce this if you 
   build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 5:08 PM, 
   shane knapp skn...@berkeley.edu (mailto:skn...@berkeley.edu) wrote:  
hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, Nan 
Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) wrote:  
 yes, I can compile locally, too but it seems that Jenkins is not 
 happy now... 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ 
 All failed to compile Best, -- Nan Zhu On Monday, October 20, 2014 at 
 7:56 PM, Ted Yu wrote:  
  I performed build on latest master branch but didn't get 
  compilation  
   
  
 error.  
  FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu 
  zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com)  
   
  
 (mailto:zhunanmcg...@gmail.com) wrote:  
   Hi, I just submitted a patch  

   
  
 https://github.com/apache/spark/pull/2864/files  
   with one line change but the Jenkins told me it's failed to 
   compile on the unrelated  

   
  
  
 
 

   files?  


   
  
  
  
 

   https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
 
   Best, Nan  

   
   
   
  
  
  
 



   
   
   
  
  
  
  
  
  
 ​
  
  
  




Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Nan Zhu
weird…..two buildings (one triggered by New, one triggered by Old) were 
executed in the same node, amp-jenkins-slave-01, one compiles, one not…

Best,  

--  
Nan Zhu


On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote:

 seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while 
 SparkPRBuilder is working fine
  
 Best,  
  
 --  
 Nan Zhu
  
  
 On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote:
  
  It's a new pull request builder written by Josh, integrated into our 
  state-of-the-art PR dashboard :)
   
  On 10/21/14 9:33 PM, Nan Zhu wrote:
   just curious…what is this “NewSparkPullRequestBuilder”?  

   Best,  

   --   
   Nan Zhu


   On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:

 
Hm, seems that 7u71 comes back again. Observed similar Kinesis 
compilation error just now: 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull
 
 
Checked Jenkins slave nodes, saw /usr/java/latest points to 
jdk1.7.0_71. However, /usr/bin/javac -version says:
 
  
 Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM Corp 
 2000, 2008. All rights reserved.
  
 
 
Which JDK is actually used by Jenkins?
 
 
Cheng
 
 
On 10/21/14 8:28 AM, shane knapp wrote:
 
 ok, so earlier today i installed a 2nd JDK within jenkins (7u71), 
 which fixed the SparkR build but apparently made Spark itself quite 
 unhappy. i removed that JDK, triggered a build ( 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),
  and it compiled kinesis w/o dying a fiery death. apparently 7u71 is 
 stricter when compiling. sad times. sorry about that! shane On Mon, 
 Oct 20, 2014 at 5:16 PM, Patrick Wendell pwend...@gmail.com 
 (mailto:pwend...@gmail.com) wrote:  
  The failure is in the Kinesis compoent, can you reproduce this if 
  you build with -Pkinesis-asl? - Patrick On Mon, Oct 20, 2014 at 
  5:08 PM, shane knapp skn...@berkeley.edu 
  (mailto:skn...@berkeley.edu) wrote:  
   hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at 5:11 PM, 
   Nan Zhu zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com) 
   wrote:  
yes, I can compile locally, too but it seems that Jenkins is 
not happy now... 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/
 All failed to compile Best, -- Nan Zhu On Monday, October 20, 
2014 at 7:56 PM, Ted Yu wrote:  
 I performed build on latest master branch but didn't get 
 compilation  
  
  
 
error.  
 FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu 
 zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com)  
  
  
 
(mailto:zhunanmcg...@gmail.com) wrote:  
  Hi, I just submitted a patch  
   
   
  
 
https://github.com/apache/spark/pull/2864/files  
  with one line change but the Jenkins told me it's failed to 
  compile on the unrelated  
   
   
  
 
 
 



   
  files?  
   
   
  
 
 
 

   
  https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console

  Best, Nan  
   
   
  
  
  
 
 
 

   
   
   
  
  
  
 
 
​
 
 
 
 


   
  



Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Nan Zhu
I agree with Sean

I just compiled spark core successfully with 7u71 in Mac OS X

On Tue, Oct 21, 2014 at 1:11 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Ah, that makes sense.  I had forgotten that there was a JIRA for this:

 https://issues.apache.org/jira/browse/SPARK-4021

 On October 21, 2014 at 10:08:58 AM, Patrick Wendell (pwend...@gmail.com)
 wrote:

 Josh - the errors that broke our build indicated that JDK5 was being
 used. Somehow the upgrade caused our build to use a much older Java
 version. See the JIRA for more details.

 On Tue, Oct 21, 2014 at 10:05 AM, Josh Rosen rosenvi...@gmail.com
 wrote:
  I find it concerning that there's a JDK version that breaks out build,
 since
  we're supposed to support Java 7. Is 7u71 an upgrade or downgrade from
 the
  JDK that we used before? Is there an easy way to fix our build so that
 it
  compiles with 7u71's stricter settings?
 
  I'm not sure why the New PRB is failing here. It was originally
 created
  as a clone of the main pull request builder job. I checked the
 configuration
  history and confirmed that there aren't any settings that we've
 forgotten to
  copy over (e.g. their configurations haven't diverged), so I'm not sure
  what's causing this.
 
  - Josh
 
  On October 21, 2014 at 6:35:39 AM, Nan Zhu (zhunanmcg...@gmail.com)
 wrote:
 
  weird.two buildings (one triggered by New, one triggered by Old)
 were
  executed in the same node, amp-jenkins-slave-01, one compiles, one
 not...
 
  Best,
 
  --
  Nan Zhu
 
 
  On Tuesday, October 21, 2014 at 9:39 AM, Nan Zhu wrote:
 
  seems that all PRs built by NewSparkPRBuilder suffers from 7u71, while
  SparkPRBuilder is working fine
 
  Best,
 
  --
  Nan Zhu
 
 
  On Tuesday, October 21, 2014 at 9:22 AM, Cheng Lian wrote:
 
   It's a new pull request builder written by Josh, integrated into our
   state-of-the-art PR dashboard :)
  
   On 10/21/14 9:33 PM, Nan Zhu wrote:
just curious...what is this NewSparkPullRequestBuilder?
   
Best,
   
--
Nan Zhu
   
   
On Tuesday, October 21, 2014 at 8:30 AM, Cheng Lian wrote:
   

 Hm, seems that 7u71 comes back again. Observed similar Kinesis
 compilation error just now:

 https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/410/consoleFull


 Checked Jenkins slave nodes, saw /usr/java/latest points to
 jdk1.7.0_71. However, /usr/bin/javac -version says:

 
  Eclipse Java Compiler 0.894_R34x, 3.4.2 release, Copyright IBM
  Corp 2000, 2008. All rights reserved.
 


 Which JDK is actually used by Jenkins?


 Cheng


 On 10/21/14 8:28 AM, shane knapp wrote:

  ok, so earlier today i installed a 2nd JDK within jenkins
 (7u71),
  which fixed the SparkR build but apparently made Spark itself
 quite unhappy.
  i removed that JDK, triggered a build (
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21943/console),

  and it compiled kinesis w/o dying a fiery death. apparently
 7u71 is stricter
  when compiling. sad times. sorry about that! shane On Mon, Oct
 20, 2014 at
  5:16 PM, Patrick Wendell pwend...@gmail.com (mailto:
 pwend...@gmail.com)
  wrote:
   The failure is in the Kinesis compoent, can you reproduce
 this
   if you build with -Pkinesis-asl? - Patrick On Mon, Oct 20,
 2014 at 5:08 PM,
   shane knapp skn...@berkeley.edu (mailto:skn...@berkeley.edu)
 wrote:
hmm, strange. i'll take a look. On Mon, Oct 20, 2014 at
 5:11
PM, Nan Zhu zhunanmcg...@gmail.com (mailto:
 zhunanmcg...@gmail.com) wrote:
 yes, I can compile locally, too but it seems that Jenkins
 is
 not happy now...

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ All
 failed to compile Best, -- Nan Zhu On Monday, October 20,
 2014 at 7:56 PM,
 Ted Yu wrote:
  I performed build on latest master branch but didn't
 get
  compilation
 
 

 error.
  FYI On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu
  zhunanmcg...@gmail.com (mailto:zhunanmcg...@gmail.com)

 
 

 (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, I just submitted a patch
  
  
 

 https://github.com/apache/spark/pull/2864/files
   with one line change but the Jenkins told me it's
 failed
   to compile on the unrelated
  
  
 



   
   
   
  
   files?
  
  
 



   
  
  
  
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
   Best, Nan
  
  
 
 
 



   
  
  
  
 
 
 







   
   
  
 
 




something wrong with Jenkins or something untested merged?

2014-10-20 Thread Nan Zhu
Hi,

I just submitted a patch https://github.com/apache/spark/pull/2864/files
with one line change

but the Jenkins told me it's failed to compile on the unrelated files?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console


Best,

Nan


Re: something wrong with Jenkins or something untested merged?

2014-10-20 Thread Nan Zhu
yes, I can compile locally, too 

but it seems that Jenkins is not happy 
now...https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/ 

All failed to compile

Best, 

-- 
Nan Zhu


On Monday, October 20, 2014 at 7:56 PM, Ted Yu wrote:

 I performed build on latest master branch but didn't get compilation error.
 
 FYI
 
 On Mon, Oct 20, 2014 at 3:51 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi,
  
  I just submitted a patch https://github.com/apache/spark/pull/2864/files
  with one line change
  
  but the Jenkins told me it's failed to compile on the unrelated files?
  
  https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21935/console
  
  
  Best,
  
  Nan
 



Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Nan Zhu
Great! Congratulations! 

-- 
Nan Zhu


On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:

 Brilliant stuff ! Congrats all :-)
 This is indeed really heartening news !
 
 Regards,
 Mridul
 
 
 On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com) wrote:
  Hi folks,
  
  I interrupt your regularly scheduled user / dev list to bring you some 
  pretty cool news for the project, which is that we've been able to use 
  Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
  faster on 10x fewer nodes. There's a detailed writeup at 
  http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
   Summary: while Hadoop MapReduce held last year's 100 TB world record by 
  sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
  206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
  
  I want to thank Reynold Xin for leading this effort over the past few 
  weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
  Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
  providing the machines to make this possible. Finally, this result would of 
  course not be possible without the many many other contributions, testing 
  and feature requests from throughout the community.
  
  For an engine to scale from these multi-hour petabyte batch jobs down to 
  100-millisecond streaming and interactive queries is quite uncommon, and 
  it's thanks to all of you folks that we are able to make this happen.
  
  Matei
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: jenkins downtime/system upgrade wednesday morning, 730am PDT

2014-09-29 Thread Nan Zhu
Just noticed these lines in the jenkins log 

= 
Running Apache RAT checks 
= 
Attempting to fetch rat Launching rat from 
/home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar Error: 
Invalid or corrupt jarfile 
/home/jenkins/workspace/SparkPullRequestBuilder/lib/apache-rat-0.10.jar RAT 
checks passed.

Something wrong?

Best, 

-- 
Nan Zhu


On Monday, September 29, 2014 at 4:43 PM, shane knapp wrote:

 happy monday, everyone!
 
 remember a few weeks back when i upgraded jenkins, and unwittingly began
 DOSing our system due to massive log spam?
 
 well, that bug has been fixed w/the current release and i'd like to get our
 logging levels back to something more verbose that we have now.
 
 downtime will be from 730am-1000am PDT (i do expect this to be done well
 before 1000am)
 
 the update will be from 1.578 - 1.582
 
 changelog here: http://jenkins-ci.org/changelog
 
 please let me know if there are any questions or concerns. thanks!
 
 shane, your friendly devops engineer 



Re: executorAdded event to DAGScheduler

2014-09-26 Thread Nan Zhu
just a quick reply, we cannot start two executors in the same host for a single 
application in the standard deployment (one worker per machine)  

I’m not sure if it will create an issue when you have multiple workers in the 
same host, as submitWaitingStages is called everywhere and I never try such a 
deployment mode

Best,  

--  
Nan Zhu


On Friday, September 26, 2014 at 8:02 AM, praveen seluka wrote:

 Can someone explain the motivation behind passing executorAdded event to 
 DAGScheduler ? DAGScheduler does submitWaitingStages when executorAdded 
 method is called by TaskSchedulerImpl. I see some issue in the below code,
  
 TaskSchedulerImpl.scala code
 if (!executorsByHost.contains(o.host)) {
 executorsByHost(o.host) = new HashSet[String]()
 executorAdded(o.executorId, o.host)
 newExecAvail = true
   }
  
  
 Note that executorAdded is called only when there is a new host and not for 
 every new executor. For instance, there can be two executors in the same host 
 and in this case. (But DAGScheduler executorAdded is notified only for new 
 host - so only once in this case). If this is indeed an issue, I would like 
 to submit a patch for this quickly. [cc Andrew Or]
  
 - Praveen
  
  



Re: A couple questions about shared variables

2014-09-24 Thread Nan Zhu
I proposed a fix https://github.com/apache/spark/pull/2524  

Glad to receive feedbacks  

--  
Nan Zhu


On Tuesday, September 23, 2014 at 9:06 PM, Sandy Ryza wrote:

 Filed https://issues.apache.org/jira/browse/SPARK-3642 for documenting these 
 nuances.
  
 -Sandy
  
 On Mon, Sep 22, 2014 at 10:36 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  I see, thanks for pointing this out  
   
   
  --  
  Nan Zhu
   
   
  On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote:
   
   MapReduce counters do not count duplications.  In MapReduce, if a task 
   needs to be re-run, the value of the counter from the second task 
   overwrites the value from the first task.

   -Sandy

   On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu zhunanmcg...@gmail.com 
   (mailto:zhunanmcg...@gmail.com) wrote:
If you think it as necessary to fix, I would like to resubmit that PR 
(seems to have some conflicts with the current DAGScheduler)  
 
My suggestion is to make it as an option in accumulator, e.g. some 
algorithms utilizing accumulator for result calculation, it needs a 
deterministic accumulator, while others implementing something like 
Hadoop counters may need the current implementation (count everything 
happened, including the duplications)
 
Your thoughts?  
 
--  
Nan Zhu
 
 
On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:
 
 Hmm, good point, this seems to have been broken by refactorings of 
 the scheduler, but it worked in the past. Basically the solution is 
 simple -- in a result stage, we should not apply the update for each 
 task ID more than once -- the same way we don't call 
 job.listener.taskSucceeded more than once. Your PR also tried to 
 avoid this for resubmitted shuffle stages, but I don't think we need 
 to do that necessarily (though we could).
  
 Matei  
  
 On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com)) wrote:
  
  Hi, Matei,  
   
  Can you give some hint on how the current implementation guarantee 
  the accumulator is only applied for once?  
   
  There is a pending PR trying to achieving this 
  (https://github.com/apache/spark/pull/228/files), but from the 
  current implementation, I didn’t see this has been done? (maybe I 
  missed something)  
   
  Best,  
   
  --   
  Nan Zhu
   
   
  On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
   
   Hey Sandy,

   On September 20, 2014 at 8:50:54 AM, Sandy Ryza 
   (sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com)) wrote: 


   Hey All,   

   A couple questions came up about shared variables recently, and I 
   wanted to   
   confirm my understanding and update the doc to be a little more 
   clear.  

   *Broadcast variables*   
   Now that tasks data is automatically broadcast, the only 
   occasions where it  
   makes sense to explicitly broadcast are:  
   * You want to use a variable from tasks in multiple stages.  
   * You want to have the variable stored on the executors in 
   deserialized  
   form.  
   * You want tasks to be able to modify the variable and have those 

   modifications take effect for other tasks running on the same 
   executor  
   (usually a very bad idea).  

   Is that right?   
   Yeah, pretty much. Reason 1 above is probably the biggest, but 2 
   also matters. (We might later factor tasks in a different way to 
   avoid 2, but it's hard due to things like Hadoop JobConf objects 
   in the tasks).


   *Accumulators*   
   Values are only counted for successful tasks. Is that right? 
   KMeans seems  
   to use it in this way. What happens if a node goes away and 
   successful  
   tasks need to be resubmitted? Or the stage runs again because a 
   different  
   job needed it.  
   Accumulators are guaranteed to give a deterministic result if you 
   only increment them in actions. For each result stage, the 
   accumulator's update from each task is only applied once, even if 
   that task runs multiple times. If you use accumulators in 
   transformations (i.e. in a stage that may be part of multiple 
   jobs), then you may see multiple updates, from each run. This is 
   kind of confusing but it was useful for people who wanted to use 
   these for debugging.

   Matei  





   thanks,   
   Sandy  



   
   
 

   
  



do MIMA checking before all test cases start?

2014-09-24 Thread Nan Zhu
Hi, all  

It seems that, currently, Jenkins makes MIMA checking after all test cases have 
finished, IIRC, during the first months we introduced MIMA, we do the MIMA 
checking before running test cases

What’s the motivation to adjust this behaviour?

In my opinion, if you have some binary compatibility issues, you just need to 
do some minor changes, but in the current environment, you can only get if your 
change works after all test cases finished (1 hour later…)

Best,  

--  
Nan Zhu



Re: do MIMA checking before all test cases start?

2014-09-24 Thread Nan Zhu
yeah, I tried that, but there is always an issue when I ran dev/mima,  

it always gives me some binary compatibility error on Java API part….

so I have to wait for Jenkins’ result when fixing MIMA issues

--  
Nan Zhu


On Thursday, September 25, 2014 at 12:04 AM, Patrick Wendell wrote:

 Have you considered running the mima checks locally? We prefer people
 not use Jenkins for very frequent checks since it takes resources away
 from other people trying to run tests.
  
 On Wed, Sep 24, 2014 at 6:44 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all
   
  It seems that, currently, Jenkins makes MIMA checking after all test cases 
  have finished, IIRC, during the first months we introduced MIMA, we do the 
  MIMA checking before running test cases
   
  What's the motivation to adjust this behaviour?
   
  In my opinion, if you have some binary compatibility issues, you just need 
  to do some minor changes, but in the current environment, you can only get 
  if your change works after all test cases finished (1 hour later...)
   
  Best,
   
  --
  Nan Zhu
   
  
  
  




Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu
If you think it as necessary to fix, I would like to resubmit that PR (seems to 
have some conflicts with the current DAGScheduler)  

My suggestion is to make it as an option in accumulator, e.g. some algorithms 
utilizing accumulator for result calculation, it needs a deterministic 
accumulator, while others implementing something like Hadoop counters may need 
the current implementation (count everything happened, including the 
duplications)

Your thoughts?  

--  
Nan Zhu


On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:

 Hmm, good point, this seems to have been broken by refactorings of the 
 scheduler, but it worked in the past. Basically the solution is simple -- in 
 a result stage, we should not apply the update for each task ID more than 
 once -- the same way we don't call job.listener.taskSucceeded more than once. 
 Your PR also tried to avoid this for resubmitted shuffle stages, but I don't 
 think we need to do that necessarily (though we could).
  
 Matei  
  
 On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com)) wrote:
  
  Hi, Matei,  
   
  Can you give some hint on how the current implementation guarantee the 
  accumulator is only applied for once?  
   
  There is a pending PR trying to achieving this 
  (https://github.com/apache/spark/pull/228/files), but from the current 
  implementation, I didn’t see this has been done? (maybe I missed something) 
   
   
  Best,  
   
  --   
  Nan Zhu
   
   
  On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
   
   Hey Sandy,

   On September 20, 2014 at 8:50:54 AM, Sandy Ryza (sandy.r...@cloudera.com 
   (mailto:sandy.r...@cloudera.com)) wrote:  

   Hey All,   

   A couple questions came up about shared variables recently, and I wanted 
   to   
   confirm my understanding and update the doc to be a little more clear.  

   *Broadcast variables*   
   Now that tasks data is automatically broadcast, the only occasions where 
   it  
   makes sense to explicitly broadcast are:  
   * You want to use a variable from tasks in multiple stages.  
   * You want to have the variable stored on the executors in deserialized  
   form.  
   * You want tasks to be able to modify the variable and have those  
   modifications take effect for other tasks running on the same executor  
   (usually a very bad idea).  

   Is that right?   
   Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
   matters. (We might later factor tasks in a different way to avoid 2, but 
   it's hard due to things like Hadoop JobConf objects in the tasks).


   *Accumulators*   
   Values are only counted for successful tasks. Is that right? KMeans seems 

   to use it in this way. What happens if a node goes away and successful  
   tasks need to be resubmitted? Or the stage runs again because a different 

   job needed it.  
   Accumulators are guaranteed to give a deterministic result if you only 
   increment them in actions. For each result stage, the accumulator's 
   update from each task is only applied once, even if that task runs 
   multiple times. If you use accumulators in transformations (i.e. in a 
   stage that may be part of multiple jobs), then you may see multiple 
   updates, from each run. This is kind of confusing but it was useful for 
   people who wanted to use these for debugging.

   Matei  





   thanks,   
   Sandy  



   
   



Re: A couple questions about shared variables

2014-09-22 Thread Nan Zhu
I see, thanks for pointing this out  


--  
Nan Zhu


On Monday, September 22, 2014 at 12:08 PM, Sandy Ryza wrote:

 MapReduce counters do not count duplications.  In MapReduce, if a task needs 
 to be re-run, the value of the counter from the second task overwrites the 
 value from the first task.
  
 -Sandy
  
 On Mon, Sep 22, 2014 at 4:55 AM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  If you think it as necessary to fix, I would like to resubmit that PR 
  (seems to have some conflicts with the current DAGScheduler)  
   
  My suggestion is to make it as an option in accumulator, e.g. some 
  algorithms utilizing accumulator for result calculation, it needs a 
  deterministic accumulator, while others implementing something like Hadoop 
  counters may need the current implementation (count everything happened, 
  including the duplications)
   
  Your thoughts?  
   
  --  
  Nan Zhu
   
   
  On Sunday, September 21, 2014 at 6:35 PM, Matei Zaharia wrote:
   
   Hmm, good point, this seems to have been broken by refactorings of the 
   scheduler, but it worked in the past. Basically the solution is simple -- 
   in a result stage, we should not apply the update for each task ID more 
   than once -- the same way we don't call job.listener.taskSucceeded more 
   than once. Your PR also tried to avoid this for resubmitted shuffle 
   stages, but I don't think we need to do that necessarily (though we 
   could).

   Matei  

   On September 21, 2014 at 1:11:13 PM, Nan Zhu (zhunanmcg...@gmail.com 
   (mailto:zhunanmcg...@gmail.com)) wrote:

Hi, Matei,  
 
Can you give some hint on how the current implementation guarantee the 
accumulator is only applied for once?  
 
There is a pending PR trying to achieving this 
(https://github.com/apache/spark/pull/228/files), but from the current 
implementation, I didn’t see this has been done? (maybe I missed 
something)  
 
Best,  
 
--   
Nan Zhu
 
 
On Sunday, September 21, 2014 at 1:10 AM, Matei Zaharia wrote:
 
 Hey Sandy,
  
 On September 20, 2014 at 8:50:54 AM, Sandy Ryza 
 (sandy.r...@cloudera.com (mailto:sandy.r...@cloudera.com)) wrote:  
  
 Hey All,   
  
 A couple questions came up about shared variables recently, and I 
 wanted to   
 confirm my understanding and update the doc to be a little more 
 clear.  
  
 *Broadcast variables*   
 Now that tasks data is automatically broadcast, the only occasions 
 where it  
 makes sense to explicitly broadcast are:  
 * You want to use a variable from tasks in multiple stages.  
 * You want to have the variable stored on the executors in 
 deserialized  
 form.  
 * You want tasks to be able to modify the variable and have those  
 modifications take effect for other tasks running on the same 
 executor  
 (usually a very bad idea).  
  
 Is that right?   
 Yeah, pretty much. Reason 1 above is probably the biggest, but 2 also 
 matters. (We might later factor tasks in a different way to avoid 2, 
 but it's hard due to things like Hadoop JobConf objects in the tasks).
  
  
 *Accumulators*   
 Values are only counted for successful tasks. Is that right? KMeans 
 seems  
 to use it in this way. What happens if a node goes away and 
 successful  
 tasks need to be resubmitted? Or the stage runs again because a 
 different  
 job needed it.  
 Accumulators are guaranteed to give a deterministic result if you 
 only increment them in actions. For each result stage, the 
 accumulator's update from each task is only applied once, even if 
 that task runs multiple times. If you use accumulators in 
 transformations (i.e. in a stage that may be part of multiple jobs), 
 then you may see multiple updates, from each run. This is kind of 
 confusing but it was useful for people who wanted to use these for 
 debugging.
  
 Matei  
  
  
  
  
  
 thanks,   
 Sandy  
  
  
  
 
 
   
  



Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu
Hi,   

Can you attach more logs to see if there is some entry from ContextCleaner?

I met very similar issue before…but haven’t get resolved  

Best,  

--  
Nan Zhu


On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:

 Dear All,  
  
 Not sure if this is a false alarm. But wanted to raise to this to understand 
 what is happening.  
  
 I am testing the Kafka Receiver which I have written 
 (https://github.com/dibbhatt/kafka-spark-consumer) which basically a low 
 level Kafka Consumer implemented custom Receivers for every Kafka topic 
 partitions and pulling data in parallel. Individual streams from all topic 
 partitions are then merged to create Union stream which used for further 
 processing.
  
 The custom Receiver working fine in normal load with no issues. But when I 
 tested this with huge amount of backlog messages from Kafka ( 50 million + 
 messages), I see couple of major issue in Spark Streaming. Wanted to get some 
 opinion on this
  
 I am using latest Spark 1.1 taken from the source and built it. Running in 
 Amazon EMR , 3 m1.xlarge Node Spark cluster running in Standalone Mode.
  
 Below are two main question I have..
  
 1. What I am seeing when I run the Spark Streaming with my Kafka Consumer 
 with a huge backlog in Kafka ( around 50 Million), Spark is completely busy 
 performing the Receiving task and hardly schedule any processing task. Can 
 you let me if this is expected ? If there is large backlog, Spark will take 
 long time pulling them . Why Spark not doing any processing ? Is it because 
 of resource limitation ( say all cores are busy puling ) or it is by design ? 
 I am setting the executor-memory to 10G and driver-memory to 4G .
  
 2. This issue seems to be more serious. I have attached the Driver trace with 
 this email. What I can see very frequently Block are selected to be 
 Removed...This kind of entries are all over the place. But when a Block is 
 removed , below problem happen May be this issue cause the issue 1 that 
 no Jobs are getting processed ..
  
  
 INFO : org.apache.spark.storage.MemoryStore - 1 blocks selected for dropping
 INFO : org.apache.spark.storage.BlockManager - Dropping block 
 input-0-1410443074600 from memory
 INFO : org.apache.spark.storage.MemoryStore - Block input-0-1410443074600 of 
 size 12651900 dropped from memory (free 21220667)
 INFO : org.apache.spark.storage.BlockManagerInfo - Removed 
 input-0-1410443074600 on ip-10-252-5-113.asskickery.us:53752 
 (http://ip-10-252-5-113.asskickery.us:53752) in memory (size: 12.1 MB, free: 
 100.6 MB)
  
 ...
  
 INFO : org.apache.spark.storage.BlockManagerInfo - Removed 
 input-0-1410443074600 on ip-10-252-5-62.asskickery.us:37033 
 (http://ip-10-252-5-62.asskickery.us:37033) in memory (size: 12.1 MB, free: 
 154.6 MB)
 ..
  
  
 WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 7.0 
 (TID 118, ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not 
 compute split, block input-0-1410443074600 not found
  
 ...
  
 INFO : org.apache.spark.scheduler.TaskSetManager - Lost task 0.1 in stage 7.0 
 (TID 126) on executor ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us): java.lang.Exception (Could not compute 
 split, block input-0-1410443074600 not found) [duplicate 1]
  
  
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 
 (TID 139, ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not 
 compute split, block input-0-1410443074600 not found
 org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:744)
  
  
 Regards,  
 Dibyendu
  
  
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
  
  
  
  
 Attachments:  
 - driver-trace.txt
  




Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu
) at 
sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
scala.collection.immutable.$colon$colon.readObject(List.scala:362) at 
sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) 
at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
 at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:169) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:744)



--  
Nan Zhu


On Thursday, September 11, 2014 at 10:42 AM, Nan Zhu wrote:

 Hi,   
  
 Can you attach more logs to see if there is some entry from ContextCleaner?
  
 I met very similar issue before…but haven’t get resolved  
  
 Best,  
  
 --  
 Nan Zhu
  
  
 On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:
  
  Dear All,  
   
  Not sure if this is a false alarm. But wanted to raise to this to 
  understand what is happening.  
   
  I am testing the Kafka Receiver which I have written 
  (https://github.com/dibbhatt/kafka-spark-consumer) which basically a low 
  level Kafka Consumer implemented custom Receivers for every Kafka topic 
  partitions and pulling data in parallel. Individual streams from all topic 
  partitions are then merged to create Union stream which used for further 
  processing.
   
  The custom Receiver working fine in normal load with no issues. But when I 
  tested this with huge amount of backlog messages from Kafka ( 50 million + 
  messages), I see couple of major issue in Spark Streaming. Wanted to get 
  some opinion on this
   
  I am using latest Spark 1.1 taken from the source and built it. Running in 
  Amazon EMR , 3 m1.xlarge Node Spark cluster running in Standalone Mode.
   
  Below are two main question I have..
   
  1. What I am seeing when I run the Spark Streaming with my Kafka Consumer 
  with a huge backlog in Kafka ( around 50 Million), Spark is completely busy 
  performing the Receiving task and hardly schedule any processing task. Can 
  you let me if this is expected ? If there is large backlog, Spark will take 
  long time pulling them . Why Spark not doing any processing ? Is it because 
  of resource limitation ( say all cores are busy puling ) or it is by design 
  ? I am setting the executor-memory to 10G and driver-memory to 4G .
   
  2

jenkins failed all tests?

2014-09-07 Thread Nan Zhu
Hi, all 

I just modified some document, 

but still failed to pass tests?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19950/consoleFull

Anyone can look at the problem?

Best, 

-- 
Nan Zhu



Re: jenkins failed all tests?

2014-09-07 Thread Nan Zhu
Hi, Sean, 

Thanks for the reply

Here are the updated files:

https://github.com/apache/spark/pull/2312/files 

just two md files...

Best, 

-- 
Nan Zhu


On Sunday, September 7, 2014 at 4:30 PM, Sean Owen wrote:

 It would help to point to your change. Are you sure it was only docs
 and are you sure you're rebased, submitting against the right branch?
 Jenkins is saying you are changing public APIs; it's not reporting
 test failures. But it could well be a test/Jenkins problem.
 
 On Sun, Sep 7, 2014 at 8:39 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  Hi, all
  
  I just modified some document,
  
  but still failed to pass tests?
  
  https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19950/consoleFull
  
  Anyone can look at the problem?
  
  Best,
  
  --
  Nan Zhu
  
 
 
 




Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Nan Zhu
+1 tested thrift server with our in-house application, everything works fine 

-- 
Nan Zhu


On Wednesday, September 3, 2014 at 4:43 PM, Matei Zaharia wrote:

 +1
 
 Matei
 
 On September 3, 2014 at 12:24:32 PM, Cheng Lian (lian.cs@gmail.com 
 (mailto:lian.cs@gmail.com)) wrote:
 
 +1. 
 
 Tested locally on OSX 10.9, built with Hadoop 2.4.1 
 
 - Checked Datanucleus jar files 
 - Tested Spark SQL Thrift server and CLI under local mode and standalone 
 cluster against MySQL backed metastore 
 
 
 
 On Wed, Sep 3, 2014 at 11:25 AM, Josh Rosen rosenvi...@gmail.com 
 (mailto:rosenvi...@gmail.com) wrote: 
 
  +1. Tested on Windows and EC2. Confirmed that the EC2 pvm-hvm switch 
  fixed the SPARK-3358 regression. 
  
  
  On September 3, 2014 at 10:33:45 AM, Marcelo Vanzin (van...@cloudera.com 
  (mailto:van...@cloudera.com)) 
  wrote: 
  
  +1 (non-binding) 
  
  - checked checksums of a few packages 
  - ran few jobs against yarn client/cluster using hadoop2.3 package 
  - played with spark-shell in yarn-client mode 
  
  On Wed, Sep 3, 2014 at 12:24 AM, Patrick Wendell pwend...@gmail.com 
  (mailto:pwend...@gmail.com) 
  wrote: 
   Please vote on releasing the following candidate as Apache Spark version 
  
  1.1.0! 
   
   The tag to be voted on is v1.1.0-rc4 (commit 2f9b2bd): 
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=2f9b2bd7844ee8393dc9c319f4fefedf95f5e460
   
   
   The release files, including signatures, digests, etc. can be found at: 
   http://people.apache.org/~pwendell/spark-1.1.0-rc4/ 
   
   Release artifacts are signed with the following key: 
   https://people.apache.org/keys/committer/pwendell.asc 
   
   The staging repository for this release can be found at: 
   https://repository.apache.org/content/repositories/orgapachespark-1031/ 
   
   The documentation corresponding to this release can be found at: 
   http://people.apache.org/~pwendell/spark-1.1.0-rc4-docs/ 
   
   Please vote on releasing this package as Apache Spark 1.1.0! 
   
   The vote is open until Saturday, September 06, at 08:30 UTC and passes if 
   a majority of at least 3 +1 PMC votes are cast. 
   
   [ ] +1 Release this package as Apache Spark 1.1.0 
   [ ] -1 Do not release this package because ... 
   
   To learn more about Apache Spark, please see 
   http://spark.apache.org/ 
   
   == Regressions fixed since RC3 == 
   SPARK-3332 - Issue with tagging in EC2 scripts 
   SPARK-3358 - Issue with regression for m3.XX instances 
   
   == What justifies a -1 vote for this release? == 
   This vote is happening very late into the QA period compared with 
   previous votes, so -1 votes should only occur for significant 
   regressions from 1.0.2. Bugs already present in 1.0.X will not block 
   this release. 
   
   == What default changes should I be aware of? == 
   1. The default value of spark.io.compression.codec is now snappy 
   -- Old behavior can be restored by switching to lzf 
   
   2. PySpark now performs external spilling during aggregations. 
   -- Old behavior can be restored by setting spark.shuffle.spill to 
   
  
  false. 
   
   3. PySpark uses a new heuristic for determining the parallelism of 
   shuffle operations. 
   -- Old behavior can be restored by setting 
   spark.default.parallelism to the number of cores in the cluster. 
   
   - 
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
   (mailto:dev-unsubscr...@spark.apache.org) 
   For additional commands, e-mail: dev-h...@spark.apache.org 
   (mailto:dev-h...@spark.apache.org) 
   
  
  
  
  
  -- 
  Marcelo 
  
  - 
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org) 
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org) 
  
 
 
 




Re: branch-1.1 will be cut on Friday

2014-07-27 Thread Nan Zhu
Good news, we will see the official version containing JDBC in very soon! 

Also, I have several pending PRs, can anyone continue the review process in 
this week?

Avoid overwriting already-set SPARK_HOME in spark-submit: 
https://github.com/apache/spark/pull/1331

fix locality inversion bug in TaskSetManager: 
https://github.com/apache/spark/pull/1313 (Matei and Mridulm are working on it)

Allow multiple executor per worker in Standalone mode: 
https://github.com/apache/spark/pull/731 

Ensure actor is self-contained  in DAGScheduler: 
https://github.com/apache/spark/pull/637

Best, 

-- 
Nan Zhu


On Sunday, July 27, 2014 at 2:31 PM, Patrick Wendell wrote:

 Hey All,
 
 Just a heads up, we'll cut branch-1.1 on this Friday, August 1st. Once
 the release branch is cut we'll start community QA and go into the
 normal triage process for merging patches into that branch.
 
 For Spark core, we'll be conservative in merging things past the
 freeze date (e.g. high priority fixes) to ensure a healthy amount of
 time for testing. A key focus of this release in core is improving
 overall stability and resilience of Spark core.
 
 As always, I'll encourage of committers/contributors to help review
 patches this week to so we can get as many things in as possible.
 People have been quite active recently, which is great!
 
 Good luck!
 - Patrick
 
 




new JDBC server test cases seems failed ?

2014-07-27 Thread Nan Zhu
] Suites: completed 2, aborted 0 [info] Tests: succeeded 0, 
failed 2, canceled 0, ignored 0, pending 0 [info] *** 2 TESTS FAILED ***

Best, 

-- 
Nan Zhu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)



spark.executor.memory is not applicable when running unit test in Jenkins?

2014-07-21 Thread Nan Zhu
Hi, all  

I’m running some unit tests for my Spark applications in Jenkins

it seems that even I set spark.executor.memory to 5g, the value I got with 
Runtime.getRuntime.maxMemory is still around 1G

Is it saying that Jenkins limit the process to use no more than 1G (by 
default)? how to change that?

Thanks,


--  
Nan Zhu



Re: Pull requests will be automatically linked to JIRA when submitted

2014-07-20 Thread Nan Zhu
Awesome!

On Saturday, July 19, 2014, Patrick Wendell pwend...@gmail.com wrote:

 Just a small note, today I committed a tool that will automatically
 mirror pull requests to JIRA issues, so contributors will no longer
 have to manually post a pull request on the JIRA when they make one.

 It will create a link on the JIRA and also make a comment to trigger
 an e-mail to people watching.

 This should make some things easier, such as avoiding accidental
 duplicate effort on the same JIRA.

 - Patrick



Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu
Ah, sorry, sorry

It's executorState under deploy package

On Monday, July 14, 2014, Patrick Wendell pwend...@gmail.com wrote:

  1. The first error I met is the different SerializationVersionUID in
 ExecuterStatus
 
  I resolved by explicitly declare SerializationVersionUID in
 ExecuterStatus.scala and recompile branch-0.1-jdbc
 

 I don't think there is a class in Spark named ExecuterStatus (sic) ...
 or ExecutorStatus. Is this a class you made?



Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu
I resolved the issue by setting an internal maven repository to contain the 
Spark-1.0.1 jar compiled from branch-0.1-jdbc and replacing the dependency to 
the central repository with our own repository 

I believe there should be some more lightweight way

Best, 

-- 
Nan Zhu


On Monday, July 14, 2014 at 6:36 AM, Nan Zhu wrote:

 Ah, sorry, sorry
 
 It's executorState under deploy package
 
 On Monday, July 14, 2014, Patrick Wendell pwend...@gmail.com 
 (mailto:pwend...@gmail.com) wrote:
   1. The first error I met is the different SerializationVersionUID in 
   ExecuterStatus
  
   I resolved by explicitly declare SerializationVersionUID in 
   ExecuterStatus.scala and recompile branch-0.1-jdbc
  
  
  I don't think there is a class in Spark named ExecuterStatus (sic) ...
  or ExecutorStatus. Is this a class you made?



assign SPARK-2126 to me?

2014-06-19 Thread Nan Zhu
Hi, all

Any admin can assign this issue 
https://issues.apache.org/jira/browse/SPARK-2126 to me?

I have started working on this

Thanks,

-- 
Nan Zhu



anyone can mark this issue as resolved?

2014-06-17 Thread Nan Zhu
Hi, 

Just found it occasionally 

https://issues.apache.org/jira/browse/SPARK-1471 

Best, 

-- 
Nan Zhu



Re: Add my JIRA username (hsaputra) to Spark's contributor's list

2014-06-03 Thread Nan Zhu
I think I lost that permission too?  

Patrick once helped to recover the permission, but I lost that permission again?

username is CodingCat, or Nan Zhu (I’m not sure which one you use when doing 
this)?

Best,  

--  
Nan Zhu


On Tuesday, June 3, 2014 at 2:39 PM, Henry Saputra wrote:

 Thanks Matei!
  
 - Henry
  
 On Tue, Jun 3, 2014 at 11:36 AM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com) wrote:
  Done. Looks like this was lost in the JIRA import.
   
  Matei
   
  On Jun 3, 2014, at 11:33 AM, Henry Saputra henry.sapu...@gmail.com 
  (mailto:henry.sapu...@gmail.com) wrote:
   
   Hi,

   Could someone with right karma kindly add my username (hsaputra) to
   Spark's contributor list?

   I was added before but somehow now I can no longer assign ticket to
   myself nor update tickets I am working on.


   Thanks,

   - Henry  



Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Nan Zhu
Hi, Sean   

I was in the same problem

but when I changed MASTER=“local” to MASTER=“local[2]”

everything back to the normal

Hasn’t get a chance to ask here

Best,  

--  
Nan Zhu


On Friday, May 30, 2014 at 9:09 AM, Sean Owen wrote:

 Guys I'm struggling to debug some strange behavior in a simple
 Streaming + Java + Kafka example -- in fact, a simplified version of
 JavaKafkaWordcount, that is just calling print() on a sequence of
 messages.
  
 Data is flowing, but it only appears to work for a few periods --
 sometimes 0 -- before ceasing to call any actions. Sorry for lots of
 log posting but it may illustrate to someone who knows this better
 what is happening:
  
  
  
 Key action in the logs seems to be as follows -- it works a few times:
  
 ...
 2014-05-30 13:53:50 INFO ReceiverTracker:58 - Stream 0 received 0 blocks
 2014-05-30 13:53:50 INFO JobScheduler:58 - Added jobs for time 140145443 
 ms
 ---
 Time: 140145443 ms
 ---
  
 2014-05-30 13:53:50 INFO JobScheduler:58 - Starting job streaming job
 140145443 ms.0 from job set of time 140145443 ms
 2014-05-30 13:53:50 INFO JobScheduler:58 - Finished job streaming job
 140145443 ms.0 from job set of time 140145443 ms
 2014-05-30 13:53:50 INFO JobScheduler:58 - Total delay: 0.004 s for
 time 140145443 ms (execution: 0.000 s)
 2014-05-30 13:53:50 INFO MappedRDD:58 - Removing RDD 2 from persistence list
 2014-05-30 13:53:50 INFO BlockManager:58 - Removing RDD 2
 2014-05-30 13:53:50 INFO BlockRDD:58 - Removing RDD 1 from persistence list
 2014-05-30 13:53:50 INFO BlockManager:58 - Removing RDD 1
 2014-05-30 13:53:50 INFO KafkaInputDStream:58 - Removing blocks of
 RDD BlockRDD[1] at BlockRDD at ReceiverInputDStream.scala:69 of time
 140145443 ms
 2014-05-30 13:54:00 INFO ReceiverTracker:58 - Stream 0 received 0 blocks
 2014-05-30 13:54:00 INFO JobScheduler:58 - Added jobs for time 140145444 
 ms
 ...
  
  
 Then works with some additional, different output in the logs -- here
 you see output is flowing too:
  
 ...
 2014-05-30 13:54:20 INFO ReceiverTracker:58 - Stream 0 received 2 blocks
 2014-05-30 13:54:20 INFO JobScheduler:58 - Added jobs for time 140145446 
 ms
 2014-05-30 13:54:20 INFO JobScheduler:58 - Starting job streaming job
 140145446 ms.0 from job set of time 140145446 ms
 2014-05-30 13:54:20 INFO SparkContext:58 - Starting job: take at
 DStream.scala:593
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Got job 1 (take at
 DStream.scala:593) with 1 output partitions (allowLocal=true)
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Final stage: Stage 1(take
 at DStream.scala:593)
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Parents of final stage: List()
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Missing parents: List()
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Computing the requested
 partition locally
 2014-05-30 13:54:20 INFO BlockManager:58 - Found block
 input-0-1401454458400 locally
 2014-05-30 13:54:20 INFO SparkContext:58 - Job finished: take at
 DStream.scala:593, took 0.007007 s
 2014-05-30 13:54:20 INFO SparkContext:58 - Starting job: take at
 DStream.scala:593
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Got job 2 (take at
 DStream.scala:593) with 1 output partitions (allowLocal=true)
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Final stage: Stage 2(take
 at DStream.scala:593)
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Parents of final stage: List()
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Missing parents: List()
 2014-05-30 13:54:20 INFO DAGScheduler:58 - Computing the requested
 partition locally
 2014-05-30 13:54:20 INFO BlockManager:58 - Found block
 input-0-1401454459400 locally
 2014-05-30 13:54:20 INFO SparkContext:58 - Job finished: take at
 DStream.scala:593, took 0.002217 s
 ---
 Time: 140145446 ms
 ---
 99,true,-0.11342268416043325
 17,false,1.6732879882133793
 ...
  
  
 Then keeps repeating the following with no more evidence that the
 print() action is being called:
  
 ...
 2014-05-30 13:54:20 INFO JobScheduler:58 - Finished job streaming job
 140145446 ms.0 from job set of time 140145446 ms
 2014-05-30 13:54:20 INFO MappedRDD:58 - Removing RDD 8 from persistence list
 2014-05-30 13:54:20 INFO JobScheduler:58 - Total delay: 0.019 s for
 time 140145446 ms (execution: 0.015 s)
 2014-05-30 13:54:20 INFO BlockManager:58 - Removing RDD 8
 2014-05-30 13:54:20 INFO BlockRDD:58 - Removing RDD 7 from persistence list
 2014-05-30 13:54:20 INFO BlockManager:58 - Removing RDD 7
 2014-05-30 13:54:20 INFO KafkaInputDStream:58 - Removing blocks of
 RDD BlockRDD[7] at BlockRDD at ReceiverInputDStream.scala:69 of time
 140145446 ms
 2014-05-30 13:54:20 INFO MemoryStore:58 - ensureFreeSpace(100) called
 with curMem=201, maxMem=2290719129
 2014-05-30 13:54:20 INFO MemoryStore:58 - Block input-0-1401454460400

Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Nan Zhu
If local[2] is expected, then the streaming doc is actually misleading? 

as the given example is 

import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
// Create a StreamingContext with a local master
val ssc = new StreamingContext(local, NetworkWordCount, Seconds(1))

http://spark.apache.org/docs/latest/streaming-programming-guide.html

I created a JIRA and a PR 

https://github.com/apache/spark/pull/924 

-- 
Nan Zhu


On Friday, May 30, 2014 at 1:53 PM, Patrick Wendell wrote:

 Yeah - Spark streaming needs at least two threads to run. I actually
 thought we warned the user if they only use one (@tdas?) but the
 warning might not be working correctly - or I'm misremembering.
 
 On Fri, May 30, 2014 at 6:38 AM, Sean Owen so...@cloudera.com 
 (mailto:so...@cloudera.com) wrote:
  Thanks Nan, that does appear to fix it. I was using local. Can
  anyone say whether that's to be expected or whether it could be a bug
  somewhere?
  
  On Fri, May 30, 2014 at 2:42 PM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   Hi, Sean
   
   I was in the same problem
   
   but when I changed MASTER=local to MASTER=local[2]
   
   everything back to the normal
   
   Hasn't get a chance to ask here
   
   Best,
   
   --
   Nan Zhu
   
  
  
 
 
 




Re: spark 1.0 standalone application

2014-05-19 Thread Nan Zhu
en, you have to put spark-assembly-*.jar to the lib directory of your 
application 

Best, 

-- 
Nan Zhu


On Monday, May 19, 2014 at 9:48 PM, nit wrote:

 I am not much comfortable with sbt. I want to build a standalone application
 using spark 1.0 RC9. I can build sbt assembly for my application with Spark
 0.9.1, and I think in that case spark is pulled from Aka Repository?
 
 Now if I want to use 1.0 RC9 for my application; what is the process ?
 (FYI, I was able to build spark-1.0 via sbt/assembly and I can see
 sbt-assembly jar; and I think I will have to copy my jar somewhere? and
 update build.sbt?)
 
 PS: I am not sure if this is the right place for this question; but since
 1.0 is still RC, I felt that this may be appropriate forum.
 
 thank! 
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/spark-1-0-standalone-application-tp6698.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 




Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-19 Thread Nan Zhu
just rerun my test on rc5 

everything works

build applications with sbt and the spark-*.jar which is compiled with Hadoop 
2.3

+1 

-- 
Nan Zhu


On Sunday, May 18, 2014 at 11:07 PM, witgo wrote:

 How to reproduce this bug?
 
 
 -- Original --
 From: Patrick Wendell;pwend...@gmail.com (mailto:pwend...@gmail.com);
 Date: Mon, May 19, 2014 10:08 AM
 To: dev@spark.apache.org (mailto:dev@spark.apache.org)dev@spark.apache.org 
 (mailto:dev@spark.apache.org); 
 Cc: Tom Gravestgraves...@yahoo.com (mailto:tgraves...@yahoo.com); 
 Subject: Re: [VOTE] Release Apache Spark 1.0.0 (rc9)
 
 
 
 Hey Matei - the issue you found is not related to security. This patch
 a few days ago broke builds for Hadoop 1 with YARN support enabled.
 The patch directly altered the way we deal with commons-lang
 dependency, which is what is at the base of this stack trace.
 
 https://github.com/apache/spark/pull/754
 
 - Patrick
 
 On Sun, May 18, 2014 at 5:28 PM, Matei Zaharia matei.zaha...@gmail.com 
 (mailto:matei.zaha...@gmail.com) wrote:
  Alright, I've opened https://github.com/apache/spark/pull/819 with the 
  Windows fixes. I also found one other likely bug, 
  https://issues.apache.org/jira/browse/SPARK-1875, in the binary packages 
  for Hadoop1 built in this RC. I think this is due to Hadoop 1's security 
  code depending on a different version of org.apache.commons than Hadoop 2, 
  but it needs investigation. Tom, any thoughts on this?
  
  Matei
  
  On May 18, 2014, at 12:33 PM, Matei Zaharia matei.zaha...@gmail.com 
  (mailto:matei.zaha...@gmail.com) wrote:
  
   I took the always fun task of testing it on Windows, and unfortunately, I 
   found some small problems with the prebuilt packages due to recent 
   changes to the launch scripts: bin/spark-class2.cmd looks in ./jars 
   instead of ./lib for the assembly JAR, and bin/run-example2.cmd doesn't 
   quite match the master-setting behavior of the Unix based one. I'll send 
   a pull request to fix them soon.
   
   Matei
   
   
   On May 17, 2014, at 11:32 AM, Sandy Ryza sandy.r...@cloudera.com 
   (mailto:sandy.r...@cloudera.com) wrote:
   
+1

Reran my tests from rc5:

* Built the release from source.
* Compiled Java and Scala apps that interact with HDFS against it.
* Ran them in local mode.
* Ran them against a pseudo-distributed YARN cluster in both yarn-client
mode and yarn-cluster mode.


On Sat, May 17, 2014 at 10:08 AM, Andrew Or and...@databricks.com 
(mailto:and...@databricks.com) wrote:

 +1
 
 
 2014-05-17 8:53 GMT-07:00 Mark Hamstra m...@clearstorydata.com 
 (mailto:m...@clearstorydata.com):
 
  +1
  
  
  On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
  pwend...@gmail.com (mailto:pwend...@gmail.com)
   wrote:
  
  
   I'll start the voting with a +1.
   
   On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell 
   pwend...@gmail.com (mailto:pwend...@gmail.com)
   wrote:
Please vote on releasing the following candidate as Apache Spark
   
   
  
  version
   1.0.0!
This has one bug fix and one minor feature on top of rc8:
SPARK-1864: https://github.com/apache/spark/pull/808
SPARK-1808: https://github.com/apache/spark/pull/799

The tag to be voted on is v1.0.0-rc9 (commit 920f947):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=920f947eb5a22a679c0c3186cf69ee75f6041c75

The release files, including signatures, digests, etc. can be 
found
 at:
http://people.apache.org/~pwendell/spark-1.0.0-rc9/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1017/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Tuesday, May 20, at 08:56 UTC and passes 
if
amajority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. 
There are
a few API changes in this release. Here are links to the 
associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
 http://people.apache.org/~pwendell/spark-1.0.0-rc8-docs/mllib

Re: Spark 1.0.0 rc3

2014-05-03 Thread Nan Zhu
SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly 

and copy the generated jar to lib/ directory of my application, 

it seems that sbt cannot find the dependencies in the jar?

but everything works with the pre-built jar files downloaded from the link 
provided by Patrick

Best, 

-- 
Nan Zhu


On Thursday, May 1, 2014 at 11:16 PM, Madhu wrote:

 I'm guessing EC2 support is not there yet?
 
 I was able to build using the binary download on both Windows 7 and RHEL 6
 without issues.
 I tried to create an EC2 cluster, but saw this:
 
 ~/spark-ec2
 Initializing spark
 ~ ~/spark-ec2
 ERROR: Unknown Spark version
 Initializing shark
 ~ ~/spark-ec2 ~/spark-ec2
 ERROR: Unknown Shark version
 
 The spark dir on the EC2 master has only a conf dir, so it didn't deploy
 properly.
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-0-rc3-tp6427p6456.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com 
 (http://Nabble.com).
 
 




Re: Any plans for new clustering algorithms?

2014-04-21 Thread Nan Zhu
I thought those are files of spark.apache.org? 

-- 
Nan Zhu


On Monday, April 21, 2014 at 9:09 PM, Xiangrui Meng wrote:

 The markdown files are under spark/docs. You can submit a PR for
 changes. -Xiangrui
 
 On Mon, Apr 21, 2014 at 6:01 PM, Sandy Ryza sandy.r...@cloudera.com 
 (mailto:sandy.r...@cloudera.com) wrote:
  How do I get permissions to edit the wiki?
  
  
  On Mon, Apr 21, 2014 at 3:19 PM, Xiangrui Meng men...@gmail.com 
  (mailto:men...@gmail.com) wrote:
  
   Cannot agree more with your words. Could you add one section about
   how and what to contribute to MLlib's guide? -Xiangrui
   
   On Mon, Apr 21, 2014 at 1:41 PM, Nick Pentreath
   nick.pentre...@gmail.com (mailto:nick.pentre...@gmail.com) wrote:
I'd say a section in the how to contribute page would be a good place
   
   to put this.

In general I'd say that the criteria for inclusion of an algorithm is it
   should be high quality, widely known, used and accepted (citations and
   concrete use cases as examples of this), scalable and parallelizable, well
   documented and with reasonable expectation of dev support

Sent from my iPhone

 On 21 Apr 2014, at 19:59, Sandy Ryza sandy.r...@cloudera.com 
 (mailto:sandy.r...@cloudera.com) wrote:
 
 If it's not done already, would it make sense to codify this 
 philosophy
 somewhere? I imagine this won't be the first time this discussion 
 comes
 up, and it would be nice to have a doc to point to. I'd be happy to
 


   
   take a
 stab at this.
 
 
  On Mon, Apr 21, 2014 at 10:54 AM, Xiangrui Meng men...@gmail.com 
  (mailto:men...@gmail.com)
   wrote:
  
  +1 on Sean's comment. MLlib covers the basic algorithms but we
  definitely need to spend more time on how to make the design 
  scalable.
  For example, think about current ProblemWithAlgorithm naming 
  scheme.
  That being said, new algorithms are welcomed. I wish they are
  well-established and well-understood by users. They shouldn't be
  research algorithms tuned to work well with a particular dataset but
  not tested widely. You see the change log from Mahout:
  
  ===
  The following algorithms that were marked deprecated in 0.8 have 
  been
  removed in 0.9:
  
  From Clustering:
  Switched LDA implementation from using Gibbs Sampling to Collapsed
  Variational Bayes (CVB)
  Meanshift
  MinHash - removed due to poor performance, lack of support and lack 
  of
  usage
  
  From Classification (both are sequential implementations)
  Winnow - lack of actual usage and support
  Perceptron - lack of actual usage and support
  
  Collaborative Filtering
  SlopeOne implementations in
  org.apache.mahout.cf.taste.hadoop.slopeone and
  org.apache.mahout.cf.taste.impl.recommender.slopeone
  Distributed pseudo recommender in
  org.apache.mahout.cf.taste.hadoop.pseudo
  TreeClusteringRecommender in
  org.apache.mahout.cf.taste.impl.recommender
  
  Mahout Math
  Hadoop entropy stuff in org.apache.mahout.math.stats.entropy
  ===
  
  In MLlib, we should include the algorithms users know how to use and
  we can provide support rather than letting algorithms come and go.
  
  My $0.02,
  Xiangrui
  
   On Mon, Apr 21, 2014 at 10:23 AM, Sean Owen so...@cloudera.com 
   (mailto:so...@cloudera.com)
   wrote:
On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown 
p...@mult.ifario.us (mailto:p...@mult.ifario.us)
   
  
 

   
   wrote:
- MLlib as Mahout.next would be a unfortunate. There are some 
gems
   
  
 

   
   in
Mahout, but there are also lots of rocks. Setting a minimal bar 
of
working, correctly implemented, and documented requires a 
surprising

   
  
  amount
of work.
   
   
   As someone with first-hand knowledge, this is correct. To Sang's
   question, I can't see value in 'porting' Mahout since it is based 
   on a
   quite different paradigm. About the only part that translates is 
   the
   algorithm concept itself.
   
   This is also the cautionary tale. The contents of the project have
   ended up being a number of drive-by contributions of 
   implementations
   that, while individually perhaps brilliant (perhaps), didn't
   necessarily match any other implementation in structure, 
   input/output,
   libraries used. The implementations were often a touch academic. 
   The
   result was hard to document, maintain, evolve or use.
   
   Far more of the structure of the MLlib implementations are 
   consistent
   by virtue of being built around Spark core already. That's great.
   
   One can't wait to completely build the foundation

Re: Flaky streaming tests

2014-04-07 Thread Nan Zhu
I met this issue when Jenkins seems to be very busy

On Monday, April 7, 2014, Kay Ousterhout k...@eecs.berkeley.edu wrote:

 Hi all,

 The InputStreamsSuite seems to have some serious flakiness issues -- I've
 seen the file input stream fail many times and now I'm seeing some actor
 input stream test failures (

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
 )
 on what I think is an unrelated change.  Does anyone know anything about
 these?  Should we just remove some of these tests since they seem to be
 constantly failing?

 -Kay



a weird test case in Streaming

2014-03-29 Thread Nan Zhu
Hi, all  

The recovery with file input stream” in the Streaming.CheckpointSuite 
sometimes failed even you are working on a totally irrelevant part, I met this 
problem for 3+ times.

I assume this test case is likely to fail when the testing servers are very 
busy?

Two cases from others:

Sean: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13561/

Mark: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13531/


Best,  

--  
Nan Zhu




Re: Travis CI

2014-03-29 Thread Nan Zhu
Hi,   

Is the migration from Jenkins to Travis finished?

I think Travis is actually not stable based on the observations in these days 
(and Jenkins becomes unstable too……  :-(  ), I’m actively working on two PRs 
related to DAGScheduler, I saw

Problem on Travis:  

1. test “large number of iterations”  in BagelSuite sometimes failed, because 
it doesn’t output anything within 10 seconds

2. hive/test usually aborted because it doesn’t output anything within 10 
minutes

3. a test case in Streaming.CheckpointSuite failed  

4. hive/test didn’t finish in 50 minutes, and was aborted

Problem on Jenkins:

1. didn’t finish in 90mins, and the process is aborted

2. the same as 3 in Travis problem

Some of these problems once appeared in Jenkins months, but not so often

I’m not complaining, I know that the admins are working hard to make the 
community run in a good condition on every aspect,  

I’m just reporting what I saw and hope that can help you to identify the problem

Thank you  

--  
Nan Zhu


On Tuesday, March 25, 2014 at 10:11 PM, Patrick Wendell wrote:

 Ya It's been a little bit slow lately because of a high error rate in
 interactions with the git-hub API. Unfortunately we are pretty slammed
 for the release and haven't had a ton of time to do further debugging.
  
 - Patrick
  
 On Tue, Mar 25, 2014 at 7:13 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  I just found that the Jenkins is not working from this afternoon
   
  for one PR, the first time build failed after 90 minutes, the second time it
  has run for more than 2 hours, no result is returned
   
  Best,
   
  --
  Nan Zhu
   
   
  On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote:
   
  That's not correct - like Michael said the Jenkins build remains the
  reference build for now.
   
  On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu zhunanmcg...@gmail.com 
  (mailto:zhunanmcg...@gmail.com) wrote:
   
  I assume the Jenkins is not working now?
   
  Best,
   
  --
  Nan Zhu
   
   
  On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:
   
  Just a quick note to everyone that Patrick and I are playing around with
  Travis CI on the Spark github repository. For now, travis does not run all
  of the test cases, so will only be turned on experimentally. Long term it
  looks like Travis might give better integration with github, so we are
  going to see if it is feasible to get all of our tests running on it.
   
  *Jenkins remains the reference CI and should be consulted before merging
  pull requests, independent of what Travis says.*
   
  If you have any questions or want to help out with the investigation, let
  me know!
   
  Michael  



Re: Migration to the new Spark JIRA

2014-03-29 Thread Nan Zhu
That’s great!  

Andy, thank you for all your contributions to the community !

Best,  

--  
Nan Zhu


On Saturday, March 29, 2014 at 11:40 PM, Patrick Wendell wrote:

 Hey All,
  
 We've successfully migrated the Spark JIRA to the Apache infrastructure.
 This turned out to be a huge effort, lead by Andy Konwinski, who deserves
 all of our deepest appreciation for managing this complex migration
  
 Since Apache runs the same JIRA version as Spark's existing JIRA, there is
 no new software to learn. A few things to note though:
  
 - The issue tracker for Spark is now at:
 https://issues.apache.org/jira/browse/SPARK
  
 - You can sign up to receive an e-mail feed of JIRA updates by e-mailing:
 issues-subscr...@spark.apache.org (mailto:issues-subscr...@spark.apache.org)
  
 - DO NOT create issues on the old JIRA. I'll try to disable this so that it
 is read-only.
  
 - You'll need to create an account at the new site if you don't have one
 already.
  
 - We've imported all the old JIRA's. In some cases the import tool can't
 correctly guess the assignee for the JIRA, so we may have to do some manual
 assignment.
  
 - If you feel like you don't have sufficient permissions on the new JIRA,
 please send me an e-mail. I tried to add all of the committers as
 administrators but I may have missed some.
  
 Thanks,
 Patrick
  
  




Re: Mailbomb from amplabs jenkins ?

2014-03-27 Thread Nan Zhu
yes, it sends for every PR you were involved 

I think Patrick is doing something on Jenkins, he just stopped some testing 
jobs manually 

Best, 

-- 
Nan Zhu


On Thursday, March 27, 2014 at 11:07 PM, Mridul Muralidharan wrote:

 Got some 100 odd mails from jenkins (?) with Can one of the admins
 verify this patch?
 Part of upgrade or some other issue ?
 Significantly reduced the snr of my inbox !
 
 Regards,
 Mridul
 
 




Re: Travis CI

2014-03-25 Thread Nan Zhu
I assume the Jenkins is not working now? 

Best, 

-- 
Nan Zhu



On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:

 Just a quick note to everyone that Patrick and I are playing around with
 Travis CI on the Spark github repository. For now, travis does not run all
 of the test cases, so will only be turned on experimentally. Long term it
 looks like Travis might give better integration with github, so we are
 going to see if it is feasible to get all of our tests running on it.
 
 *Jenkins remains the reference CI and should be consulted before merging
 pull requests, independent of what Travis says.*
 
 If you have any questions or want to help out with the investigation, let
 me know!
 
 Michael 



Re: Travis CI

2014-03-25 Thread Nan Zhu
I just found that the Jenkins is not working from this afternoon

for one PR, the first time build failed after 90 minutes, the second time it 
has run for more than 2 hours, no result is returned

Best, 

-- 
Nan Zhu



On Tuesday, March 25, 2014 at 10:06 PM, Patrick Wendell wrote:

 That's not correct - like Michael said the Jenkins build remains the
 reference build for now.
 
 On Tue, Mar 25, 2014 at 7:03 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  I assume the Jenkins is not working now?
  
  Best,
  
  --
  Nan Zhu
  
  
  On Tuesday, March 25, 2014 at 6:42 PM, Michael Armbrust wrote:
  
  Just a quick note to everyone that Patrick and I are playing around with
  Travis CI on the Spark github repository. For now, travis does not run all
  of the test cases, so will only be turned on experimentally. Long term it
  looks like Travis might give better integration with github, so we are
  going to see if it is feasible to get all of our tests running on it.
  
  *Jenkins remains the reference CI and should be consulted before merging
  pull requests, independent of what Travis says.*
  
  If you have any questions or want to help out with the investigation, let
  me know!
  
  Michael 



How the scala style checker works?

2014-03-19 Thread Nan Zhu
Hi, all  

I’m just curious about the working mechanism of scala style checker

When I work on a PR, I found that the following line contains 101 chars, 
violating the 100 limitation  

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L515

but the current scala style checker passes this line?

Best,  

--  
Nan Zhu




ping of PR #12

2014-03-10 Thread Nan Zhu
Hi, all

I understand that you are very busy,

But it seems that this PR has been there for a long while, and there have been 
some discussions in its incubator-spark version: 
https://github.com/apache/incubator-spark/pull/636

The current URL:

https://github.com/apache/spark/pull/12

Thank you very much! 

-- 
Nan Zhu




Undocumented configuration parameters

2014-03-05 Thread Nan Zhu
Hi, all  

Just for curiosity, I grep the  source code in core component yesterday, and 
found that there are about 30 configuration parameters being used but 
undocumented

I have been reading the source code and writing documents for them, nearly 
finished…

But I would like to ask before I made the PR, what’s the reason of the missing 
documentations, the contributor forgot to update the docs, or they are intended 
to be hidden maybe some parameters are not expected to be changed by the user?

Best,  

--  
Nan Zhu



Re: Spark JIRA

2014-02-28 Thread Nan Zhu
I think they are working on it? https://issues.apache.org/jira/browse/SPARK 

Best, 

-- 
Nan Zhu


On Friday, February 28, 2014 at 2:29 PM, Evan Chan wrote:

 Hey guys,
 
 There is no plan to move the Spark JIRA from the current
 https://spark-project.atlassian.net/
 
 right?
 
 -- 
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com (mailto:e...@ooyala.com) |
 
 




Discussion on SPARK-1139

2014-02-26 Thread Nan Zhu
Hi, all  

I just created a JIRA https://spark-project.atlassian.net/browse/SPARK-1139 . 
The issue discusses that:

the new Hadoop API based Spark APIs are actually a mixture of old and new 
Hadoop API.

Spark APIs are still using JobConf (or Configuration) as one of the parameters, 
but actually Configuration has been replace by mapreduce.Job in the new Hadoop 
API

for example : 
http://codesfusion.blogspot.ca/2013/10/hadoop-wordcount-with-new-map-reduce-api.html
  

  

http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api (p10)

Personally I think it’s better to fix this design, but it will introduce some 
compatibility issue  

Just bring it here for your advices

Best,  

--  
Nan Zhu