Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-05 Thread Prashant Sharma
Congratulations 

On Wed, 4 Oct, 2023, 8:52 pm huaxin gao,  wrote:

> Congratulations!
>
> On Wed, Oct 4, 2023 at 7:39 AM Chao Sun  wrote:
>
>> Congratulations!
>>
>> On Wed, Oct 4, 2023 at 5:11 AM Jungtaek Lim 
>> wrote:
>>
>>> Congrats!
>>>
>>> 2023년 10월 4일 (수) 오후 5:04, yangjie01 님이 작성:
>>>
 Congratulations!



 Jie Yang



 *发件人**: *Dongjoon Hyun 
 *日期**: *2023年10月4日 星期三 13:04
 *收件人**: *Hyukjin Kwon 
 *抄送**: *Hussein Awala , Rui Wang <
 amaliu...@apache.org>, Gengliang Wang , Xiao Li <
 gatorsm...@gmail.com>, "dev@spark.apache.org" 
 *主题**: *Re: Welcome to Our New Apache Spark Committer and PMCs



 Congratulations!



 Dongjoon.



 On Tue, Oct 3, 2023 at 5:25 PM Hyukjin Kwon 
 wrote:

 Woohoo!



 On Tue, 3 Oct 2023 at 22:47, Hussein Awala  wrote:

 Congrats to all of you!



 On Tue 3 Oct 2023 at 08:15, Rui Wang  wrote:

 Congratulations! Well deserved!



 -Rui





 On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang 
 wrote:

 Congratulations to all! Well deserved!



 On Mon, Oct 2, 2023 at 10:16 PM Xiao Li  wrote:

 Hi all,

 The Spark PMC is delighted to announce that we have voted to add one
 new committer and two new PMC members. These individuals have consistently
 contributed to the project and have clearly demonstrated their expertise.

 New Committer:
 - Jiaan Geng (focusing on Spark Connect and Spark SQL)

 New PMCs:
 - Yuanjian Li
 - Yikun Jiang

 Please join us in extending a warm welcome to them in their new roles!

 Sincerely,
 The Spark PMC




Re: Resolves too old JIRAs as incomplete

2021-05-20 Thread Prashant Sharma
+1

On Thu, May 20, 2021 at 7:08 PM Wenchen Fan  wrote:

> +1
>
> On Thu, May 20, 2021 at 11:59 AM Dongjoon Hyun 
> wrote:
>
>> +1.
>>
>> Thank you, Takeshi.
>>
>> On Wed, May 19, 2021 at 7:49 PM Hyukjin Kwon  wrote:
>>
>>> Yeah, I wanted to discuss this. I agree since 2.4.x became EOL
>>>
>>> 2021년 5월 20일 (목) 오전 10:54, Sean Owen 님이 작성:
>>>
 I agree. Such old JIRAs are 99% obsolete. If anyone objects to a
 particular issue being closed, they can comment and we can reopen. It's a
 very reversible thing. There is value in keeping JIRA up to date with
 reality.

 On Wed, May 19, 2021 at 8:47 PM Takeshi Yamamuro 
 wrote:

> Hi, dev,
>
> As you know, we have too many open JIRAs now:
> # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
> Progress", Reopened)'
>
> We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
> JIRAs
> for making the JIRAs manageable.
>
> As Hyukjin did the same action two years ago (for details, see:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
> I'm planning to use a similar JQL below to close them:
>
> project = SPARK AND status in (Open, "In Progress", Reopened) AND
> (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
> AND updated <= -52w
>
> The total number of matched JIRAs is 741.
> Or, we might be able to close them more aggressively by removing the
> version condition:
>
> project = SPARK AND status in (Open, "In Progress", Reopened) AND
> updated <= -52w
>
> The matched number is 1484 (almost half of the current open JIRAs).
>
> If there is no objection, I'd like to do it next week or later.
> Any thoughts?
>
> Bests,
> Takeshi
> --
> ---
> Takeshi Yamamuro
>



Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Prashant Sharma
Congratulations  all!!

On Sat, Mar 27, 2021, 5:10 AM huaxin gao  wrote:

> Congratulations to you all!!
>
> On Fri, Mar 26, 2021 at 4:22 PM Yuming Wang  wrote:
>
>> Congrats!
>>
>> On Sat, Mar 27, 2021 at 7:13 AM Takeshi Yamamuro 
>> wrote:
>>
>>> Congrats, all~
>>>
>>> On Sat, Mar 27, 2021 at 7:46 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Congrats all!

 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성:

> Congrats! Welcome!
>
>
> Matei Zaharia wrote
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers. Please
> join me
> > in welcoming them to their new role! Our new committers are:
> >
> > - Maciej Szymkiewicz (contributor to PySpark)
> > - Max Gekk (contributor to Spark SQL)
> > - Kent Yao (contributor to Spark SQL)
> > - Attila Zsolt Piros (contributor to decommissioning and Spark on
> > Kubernetes)
> > - Yi Wu (contributor to Spark Core and SQL)
> > - Gabor Somogyi (contributor to Streaming and security)
> >
> > All six of them contributed to Spark 3.1 and we’re very excited to
> have
> > them join as committers.
> >
> > Matei and the Spark PMC
> > -
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>


Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-16 Thread Prashant Sharma
+1

On Tue, Feb 16, 2021 at 1:22 PM Dongjoon Hyun 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.2.
>
> The vote is open until February 19th 9AM (PST) and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.0.2-rc1 (commit
> 648457905c4ea7d00e3d88048c63f360045f0714):
> https://github.com/apache/spark/tree/v3.0.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1366/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>
> The list of bug fixes going into 3.0.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.2?
> ===
>
> The current list of open tickets targeted at 3.0.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-19 Thread Prashant Sharma
+1

On Tue, Jan 19, 2021 at 4:38 PM Yang,Jie(INF)  wrote:

> +1
>
>
>
> *发件人**: *Gengliang Wang 
> *日期**: *2021年1月19日 星期二 下午3:04
> *收件人**: *Jungtaek Lim 
> *抄送**: *Yuming Wang , Hyukjin Kwon ,
> dev 
> *主题**: *Re: [VOTE] Release Spark 3.1.1 (RC1)
>
>
>
> +1 (non-binding)
>
>
>
>
>
> On Tue, Jan 19, 2021 at 2:05 PM Jungtaek Lim 
> wrote:
>
> +1 (non-binding)
>
>
>
> * verified signature and sha for all files (there's a glitch which I'll
> describe in below)
>
> * built source (DISCLAIMER: didn't run tests) and made custom
> distribution, and built a docker image based on the distribution
>
>   - used profiles: kubernetes, hadoop-3.2, hadoop-cloud
>
> * ran some SS PySpark queries (Rate to Kafka, Kafka to Kafka) with Spark
> on k8s (used MinIO - s3 compatible - as checkpoint location)
>
>   - for Kafka reader, tested both approaches: newer (offset via admin
> client) and older (offset via consumer)
>
> * ran simple batch query with magic committer against MinIO storage &
> dynamic volume provisioning (with NFS)
>
> * verified DataStreamReader.table & DataStreamWriter.toTable works in
> PySpark (which also verifies on Scala API as well)
>
> * ran test stateful SS queries and checked the new additions of SS UI
> (state store & watermark information)
>
>
>
> A glitch from verifying sha; the file format of sha512 is different
> between source targz and others. My tool succeeded with others and failed
> with source targz, though I confirmed sha itself is the same. Not a blocker
> but would be ideal if we can make it be consistent.
>
>
>
> Thanks for driving the release process!
>
>
>
> On Tue, Jan 19, 2021 at 2:25 PM Yuming Wang  wrote:
>
> +1.
>
>
>
> On Tue, Jan 19, 2021 at 7:54 AM Hyukjin Kwon  wrote:
>
> I forgot to say :). I'll start with my +1.
>
>
>
> On Mon, 18 Jan 2021, 21:06 Hyukjin Kwon,  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.1.1.
>
>
>
> The vote is open until January 22nd 4PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.1.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
>
>
>
> The tag to be voted on is v3.1.1-rc1 (commit
> 53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):
>
> https://github.com/apache/spark/tree/v3.1.1-rc1
> 
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/
> 
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1364
> 
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/
> 
>
>
>
> The list of bug fixes going into 3.1.1 can be found at the following URL:
>
> https://s.apache.org/41kf2
> 
>
>
>
> This release is using the release script of the tag v3.1.1-rc1.
>
>
>
> FAQ
>
>
>
> ===
> What happened to 3.1.0?
>
> ===
>
> There was a technical issue during Apache Spark 3.1.0 preparation, and it
> was discussed and decided to skip 3.1.0.
> Please see
> https://spark.apache.org/news/next-official-release-spark-3.1.1.html
> 
> for more details.
>
> =
>
> How can I help test this release?
>
> =
>
>
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
>
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz
> 

Automatically mount the user-specific configurations on K8s cluster, using a config map.

2020-09-21 Thread Prashant Sharma
Hi All,

This is regarding an improvement issue SPARK-30985(
https://github.com/apache/spark/pull/27735). Has this caught someone's
attention yet?

Basically, SPARK_CONF_DIR hosts all the user specific configuration files,
e.g.


   1. spark-defaults.conf - containing all the spark properties.
   2. log4j.properties - Logger configuration.
   3. core-site.xml - Hadoop related configuration.
   4. fairscheduler.xml - Spark's fair scheduling policy at the job level.
   5. metrics.properties - Spark metrics.
   6. Any user specific - library or framework specific configuration file.

I wanted to know, how the users of spark on k8s, propagate these
configurations to the spark driver in cluster mode? and what if some
configuration needs to be picked by executors (e.g. HDFS core-site.xml
etc...)

Please take a look at: https://github.com/apache/spark/pull/27735

Thanks,

Prashant Sharma


Re: [VOTE][RESULT] Release Spark 2.4.7 (RC3)

2020-09-12 Thread Prashant Sharma
You are welcome, Dongjoon - just did my part. Thanks to everyone who
contributed in various ways.

Thanks,

On Sat, Sep 12, 2020 at 9:29 AM Dongjoon Hyun 
wrote:

> Thank you, Prashant!
>
> Bests,
> Dongjoon.
>
> On Fri, Sep 11, 2020 at 7:02 PM Prashant Sharma 
> wrote:
>
>> The vote passes. Thanks to all who helped with the release!
>>
>>  (* = binding)
>> +1:
>> - Sean Owen *
>> - Wenchan Fan *
>> - Dongjoon Hyun *
>> - Mridul *
>>
>> +0: None
>>
>> -1: None
>>
>>
>>
>>


[ANNOUNCE] Announcing Apache Spark 2.4.7

2020-09-12 Thread Prashant Sharma
Hello Folks,

We are happy to announce the availability of Spark 2.4.7!
Spark 2.4.7 is a maintenance release containing stability fixes. This
release is based on the branch-2.4 maintenance branch of Spark. We strongly
recommend all 2.4 users to upgrade to this stable release.

To download Spark 2.4.7, head over to the download page:
http://spark.apache.org/downloads.html

Note that you might need to clear your browser cache or to use
`Private`/`Incognito` mode according to your browsers.

To view the release notes:
https://spark.apache.org/releases/spark-release-2-4-7.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Enjoy the release !

Thanks,
Prashant Sharma


[VOTE][RESULT] Release Spark 2.4.7 (RC3)

2020-09-11 Thread Prashant Sharma
The vote passes. Thanks to all who helped with the release!

 (* = binding)
+1:
- Sean Owen *
- Wenchan Fan *
- Dongjoon Hyun *
- Mridul *

+0: None

-1: None


Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-10 Thread Prashant Sharma
Thanks again, looks like it works now. Please take a look.

On Thu, Sep 10, 2020 at 11:42 AM Prashant Sharma 
wrote:

> Hi Wenchen and Sean,
>
> Thanks for looking into this and all the details.
>
> I have now updated the key in those keyservers. Now, how do I refresh
> nexus?
>
> Thanks,
>
> On Thu, Sep 10, 2020 at 9:13 AM Sean Owen  wrote:
>
>> Yes I can do that and I am sure it's fine, but why has it been visible in
>> the past and not now? Minor thing to fix.
>>
>> On Wed, Sep 9, 2020, 9:09 PM Wenchen Fan  wrote:
>>
>>> Sean, you need to login https://repository.apache.org/ and pick the
>>> staging repo 1361, then check its status, you will see this
>>> [image: image.png]
>>>
>>> On Thu, Sep 10, 2020 at 9:26 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>>
>>>> I imported our KEYS file locally [1] to validate ... did not use
>>>> external keyserver.
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> [1] wget https://dist.apache.org/repos/dist/dev/spark/KEYS -O - | gpg
>>>> --import
>>>>
>>>> On Wed, Sep 9, 2020 at 8:03 PM Wenchen Fan  wrote:
>>>>
>>>>> I checked
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1361/ ,
>>>>> it says the Signature Validation failed.
>>>>>
>>>>> Prashant, can you double-check your gpg key and make sure it's
>>>>> uploaded to public key servers like the following?
>>>>> http://pool.sks-keyservers.net:11371
>>>>> http://keyserver.ubuntu.com:11371
>>>>>
>>>>>
>>>>> On Wed, Sep 9, 2020 at 6:12 AM Mridul Muralidharan 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> Signatures, digests, etc check out fine.
>>>>>> Checked out tag and built/tested with -Pyarn -Phadoop-2.7 -Phive
>>>>>> -Phive-thriftserver -Pmesos -Pkubernetes
>>>>>>
>>>>>> Thanks,
>>>>>> Mridul
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 8, 2020 at 8:55 AM Prashant Sharma 
>>>>>> wrote:
>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 2.4.7.
>>>>>>>
>>>>>>> The vote is open until Sep 11th at 9AM PST and passes if a majority
>>>>>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 2.4.7
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> http://spark.apache.org/
>>>>>>>
>>>>>>> There are currently no issues targeting 2.4.7 (try project = SPARK
>>>>>>> AND "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In
>>>>>>> Progress"))
>>>>>>>
>>>>>>> The tag to be voted on is v2.4.7-rc3 (commit
>>>>>>> 14211a19f53bd0f413396582c8970e3e0a74281d):
>>>>>>> https://github.com/apache/spark/tree/v2.4.7-rc3
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/
>>>>>>>
>>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>>
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1361/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/
>>>>>>>
>>>>>>> The list of bug fixes going into 2.4.7 can be found at the following
>>>>>>> URL:
>>>>>>> https://s.apache.org/spark-v2.4.7-rc3
>>>>>>>
>>>>>>> This release is using the release scr

Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-10 Thread Prashant Sharma
Hi Wenchen and Sean,

Thanks for looking into this and all the details.

I have now updated the key in those keyservers. Now, how do I refresh
nexus?

Thanks,

On Thu, Sep 10, 2020 at 9:13 AM Sean Owen  wrote:

> Yes I can do that and I am sure it's fine, but why has it been visible in
> the past and not now? Minor thing to fix.
>
> On Wed, Sep 9, 2020, 9:09 PM Wenchen Fan  wrote:
>
>> Sean, you need to login https://repository.apache.org/ and pick the
>> staging repo 1361, then check its status, you will see this
>> [image: image.png]
>>
>> On Thu, Sep 10, 2020 at 9:26 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> I imported our KEYS file locally [1] to validate ... did not use
>>> external keyserver.
>>>
>>> Regards,
>>> Mridul
>>>
>>> [1] wget https://dist.apache.org/repos/dist/dev/spark/KEYS -O - | gpg
>>> --import
>>>
>>> On Wed, Sep 9, 2020 at 8:03 PM Wenchen Fan  wrote:
>>>
>>>> I checked
>>>> https://repository.apache.org/content/repositories/orgapachespark-1361/ ,
>>>> it says the Signature Validation failed.
>>>>
>>>> Prashant, can you double-check your gpg key and make sure it's uploaded
>>>> to public key servers like the following?
>>>> http://pool.sks-keyservers.net:11371
>>>> http://keyserver.ubuntu.com:11371
>>>>
>>>>
>>>> On Wed, Sep 9, 2020 at 6:12 AM Mridul Muralidharan 
>>>> wrote:
>>>>
>>>>>
>>>>> +1
>>>>>
>>>>> Signatures, digests, etc check out fine.
>>>>> Checked out tag and built/tested with -Pyarn -Phadoop-2.7 -Phive
>>>>> -Phive-thriftserver -Pmesos -Pkubernetes
>>>>>
>>>>> Thanks,
>>>>> Mridul
>>>>>
>>>>>
>>>>> On Tue, Sep 8, 2020 at 8:55 AM Prashant Sharma 
>>>>> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 2.4.7.
>>>>>>
>>>>>> The vote is open until Sep 11th at 9AM PST and passes if a majority
>>>>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 2.4.7
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> There are currently no issues targeting 2.4.7 (try project = SPARK
>>>>>> AND "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In
>>>>>> Progress"))
>>>>>>
>>>>>> The tag to be voted on is v2.4.7-rc3 (commit
>>>>>> 14211a19f53bd0f413396582c8970e3e0a74281d):
>>>>>> https://github.com/apache/spark/tree/v2.4.7-rc3
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/
>>>>>>
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1361/
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/
>>>>>>
>>>>>> The list of bug fixes going into 2.4.7 can be found at the following
>>>>>> URL:
>>>>>> https://s.apache.org/spark-v2.4.7-rc3
>>>>>>
>>>>>> This release is using the release script of the tag v2.4.7-rc3.
>>>>>>
>>>>>> FAQ
>>>>>>
>>>>>>
>>>>>> =
>>>>>> How can I help test this release?
>>>>>> =
>>>>>>
>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>> reporting any regressions.
>>>>>>
>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>>> you can add the staging repository to your projects resolvers and test
>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>> you don't end up building with an out of date RC going forward).
>>>>>>
>>>>>> ===
>>>>>> What should happen to JIRA tickets still targeting 2.4.7?
>>>>>> ===
>>>>>>
>>>>>> The current list of open tickets targeted at 2.4.7 can be found at:
>>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>>> Version/s" = 2.4.7
>>>>>>
>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>> appropriate release.
>>>>>>
>>>>>> ==
>>>>>> But my bug isn't fixed?
>>>>>> ==
>>>>>>
>>>>>> In order to make timely releases, we will typically not hold the
>>>>>> release unless the bug in question is a regression from the previous
>>>>>> release. That being said, if there is something which is a regression
>>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>> help target the issue.
>>>>>>
>>>>>


[VOTE] Release Spark 2.4.7 (RC3)

2020-09-08 Thread Prashant Sharma
Please vote on releasing the following candidate as Apache Spark
version 2.4.7.

The vote is open until Sep 11th at 9AM PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.7
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.7 (try project = SPARK AND
"Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))

The tag to be voted on is v2.4.7-rc3 (commit
14211a19f53bd0f413396582c8970e3e0a74281d):
https://github.com/apache/spark/tree/v2.4.7-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1361/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/

The list of bug fixes going into 2.4.7 can be found at the following URL:
https://s.apache.org/spark-v2.4.7-rc3

This release is using the release script of the tag v2.4.7-rc3.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.7?
===

The current list of open tickets targeted at 2.4.7 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.7

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: Running K8s integration tests for changes in core?

2020-08-20 Thread Prashant Sharma
Another option is, if we could have something like "presubmit" PR build. In
other words, running the entire 4 H + K8s integration on each commit pushed
is too much at the same time and there are chances that one thing can
inadvertently affect other components(as you just said).

A presubmit(which includes K8s integration tests) build will be run, once
the PR receives LGTM from "Approved reviewers". This is one criteria that
comes to my mind, others may have better suggestions.

On Thu, Aug 20, 2020 at 12:25 AM shane knapp ☠  wrote:

> we'll be gated by the number of ubuntu workers w/minikube and docker, but
> it shouldn't be too bad as the full integration test takes ~45m, vs 4+ hrs
> for the regular PRB.
>
> i can enable this in about 1m of time if the consensus is for us to want
> this.
>
> On Wed, Aug 19, 2020 at 11:37 AM Holden Karau 
> wrote:
>
>> Sounds good. In the meantime would folks committing things in core run
>> the K8s PRB or run it locally? A second change this morning was committed
>> that broke the K8s PR tests.
>>
>> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma 
>> wrote:
>>
>>> +1, we should enable.
>>>
>>> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau 
>>> wrote:
>>>
>>>> Hi Dev Folks,
>>>>
>>>> I was wondering how people feel about enabling the K8s PRB
>>>> automatically for all core changes? Sometimes I forget that a change might
>>>> impact one of the K8s integration tests since a bunch of them look at log
>>>> messages. Would folks be OK with turning on the K8s integration PRB for all
>>>> core changes as well as K8s changes?
>>>>
>>>> Cheers,
>>>>
>>>> Holden :)
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Running K8s integration tests for changes in core?

2020-08-18 Thread Prashant Sharma
+1, we should enable.

On Wed, Aug 19, 2020 at 9:18 AM Holden Karau  wrote:

> Hi Dev Folks,
>
> I was wondering how people feel about enabling the K8s PRB automatically
> for all core changes? Sometimes I forget that a change might impact one of
> the K8s integration tests since a bunch of them look at log messages. Would
> folks be OK with turning on the K8s integration PRB for all core changes as
> well as K8s changes?
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-09 Thread Prashant Sharma
Thanks for letting us know. So this vote is cancelled in favor of RC2.



On Sun, Aug 9, 2020 at 8:31 AM Takeshi Yamamuro 
wrote:

> Thanks for letting us know about the two issues above, Dongjoon.
>
> 
> I've checked the release materials (signatures, tag, ...) and it looks
> fine, too.
> Also, I run the tests on my local Mac (java 1.8.0) with the options
> `-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
> -Psparkr`
> and they passed.
>
> Bests,
> Takeshi
>
>
>
> On Sun, Aug 9, 2020 at 11:06 AM Dongjoon Hyun 
> wrote:
>
>> Another instance is SPARK-31703 which filed on May 13th and the PR
>> arrived two days ago.
>>
>> [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on
>> big endian platforms
>> https://github.com/apache/spark/pull/29383
>>
>> It seems that the patch is already ready in this case.
>> I raised the priority of SPARK-31703 to `Blocker` for both Apache Spark
>> 2.4.7 and 3.0.1.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Sat, Aug 8, 2020 at 6:10 AM Holden Karau  wrote:
>>
>>> I'm going to go ahead and vote -0 then based on that then.
>>>
>>> On Fri, Aug 7, 2020 at 11:36 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Unfortunately, there is an on-going discussion about the new decimal
>>>> correctness.
>>>>
>>>> Although we fixed one correctness issue at master and backported it
>>>> partially to 3.0/2.4, it turns out that it needs more patched to be
>>>> complete.
>>>>
>>>> Please see https://github.com/apache/spark/pull/29125 for on-going
>>>> discussion for both 3.0/2.4.
>>>>
>>>> [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with
>>>> overflowed value
>>>>
>>>> I also confirmed that 2.4.7 RC1 is affected.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Aug 6, 2020 at 2:48 PM Sean Owen  wrote:
>>>>
>>>>> +1 from me. The same as usual. Licenses and sigs look OK, builds and
>>>>> passes tests on a standard selection of profiles.
>>>>>
>>>>> On Thu, Aug 6, 2020 at 7:07 AM Prashant Sharma 
>>>>> wrote:
>>>>> >
>>>>> > Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.4.7.
>>>>> >
>>>>> > The vote is open until Aug 9th at 9AM PST and passes if a majority
>>>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>> >
>>>>> > [ ] +1 Release this package as Apache Spark 2.4.7
>>>>> > [ ] -1 Do not release this package because ...
>>>>> >
>>>>> > To learn more about Apache Spark, please see
>>>>> http://spark.apache.org/
>>>>> >
>>>>> > There are currently no issues targeting 2.4.7 (try project = SPARK
>>>>> AND "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In
>>>>> Progress"))
>>>>> >
>>>>> > The tag to be voted on is v2.4.7-rc1 (commit
>>>>> dc04bf53fe821b7a07f817966c6c173f3b3788c6):
>>>>> > https://github.com/apache/spark/tree/v2.4.7-rc1
>>>>> >
>>>>> > The release files, including signatures, digests, etc. can be found
>>>>> at:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-bin/
>>>>> >
>>>>> > Signatures used for Spark RCs can be found in this file:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>> >
>>>>> > The staging repository for this release can be found at:
>>>>> >
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1352/
>>>>> >
>>>>> > The documentation corresponding to this release can be found at:
>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-docs/
>>>>> >
>>>>> > The list of bug fixes going into 2.4.7 can be found at the following
>>>>> URL:
>>>>> > https://s.apache.org/spark-v2.4.7-rc1
>>>>> >
>>>>> > This release is using the release script of the tag v2.4.7-rc1.
>>>>> >
>>>>> > FAQ
>>&

[VOTE] Release Spark 2.4.7 (RC1)

2020-08-06 Thread Prashant Sharma
Please vote on releasing the following candidate as Apache Spark
version 2.4.7.

The vote is open until Aug 9th at 9AM PST and passes if a majority +1 PMC
votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.4.7
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

There are currently no issues targeting 2.4.7 (try project = SPARK AND
"Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In Progress"))

The tag to be voted on is v2.4.7-rc1 (commit
dc04bf53fe821b7a07f817966c6c173f3b3788c6):
https://github.com/apache/spark/tree/v2.4.7-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1352/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc1-docs/

The list of bug fixes going into 2.4.7 can be found at the following URL:
https://s.apache.org/spark-v2.4.7-rc1

This release is using the release script of the tag v2.4.7-rc1.

FAQ


=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.4.7?
===

The current list of open tickets targeted at 2.4.7 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.4.7

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: spark-on-k8s is still experimental?

2020-08-05 Thread Prashant Sharma
My thoughts are, External shuffle service is not a blocker for spark on k8s
to be a production ready.

Others may think otherwise, but there are other ways to enable auto
scaling. External shuffle service feature will be useful for all the ways
of deployments be it yarn, standalone, k8s and not just k8s.

About the GA, I have not yet seen a very large deployment working yet.
Others can share, how they are using spark on k8s, that can give us more
confidence on moving towards GA.

Thanks,

On Thu, Aug 6, 2020 at 9:18 AM Holden Karau  wrote:

> Sounds good. I think we can make a slightly stronger statement than that
> one (left a comment, but it's my own thoughts so others should chime in if
> they have a different opinion).
>
> On Wed, Aug 5, 2020 at 7:32 PM Takeshi Yamamuro 
> wrote:
>
>> Thanks for the info, all. okay, I understood that we need more time to
>> announce GA officially.
>> But, I'm still worried that users hesitate a bit to use this feature by
>> referring to the statement in the doc,
>> so how about updating it according to the current situation? Please check
>> my suggestion in https://github.com/apache/spark/pull/29368.
>>
>> Anyway, many thanks!
>>
>>
>> On Tue, Aug 4, 2020 at 12:26 AM Holden Karau 
>> wrote:
>>
>>> There was discussion around removing the statement and declaring it GA
>>> but I believe it was decided to leave it in until an external shuffle
>>> service is supported on K8s.
>>>
>>> On Mon, Aug 3, 2020 at 2:45 AM JackyLee  wrote:
>>>
 +1. It has been worked well in our company and we has used it to support
 online services since March in this year.



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-08-02 Thread Prashant Sharma
+1

On Fri, Jul 31, 2020 at 10:18 PM Xiao Li  wrote:

> +1
>
> Xiao
>
> On Fri, Jul 31, 2020 at 9:32 AM Mridul Muralidharan 
> wrote:
>
>>
>> +1
>>
>> Thanks,
>> Mridul
>>
>> On Thu, Jul 30, 2020 at 4:49 PM Holden Karau 
>> wrote:
>>
>>> Hi Spark Developers,
>>>
>>> After the discussion of the proposal to amend Spark committer
>>> guidelines, it appears folks are generally in agreement on policy
>>> clarifications. (See
>>> https://lists.apache.org/thread.html/r6706e977fda2c474a7f24775c933c2f46ea19afbfafb03c90f6972ba%40%3Cdev.spark.apache.org%3E,
>>> as well as some on the private@ list for PMC.) Therefore, I am calling
>>> for a majority VOTE, which will last at least 72 hours. See the ASF voting
>>> rules for procedural changes at
>>> https://www.apache.org/foundation/voting.html.
>>>
>>> The proposal is to add a new section entitled “When to Commit” to the
>>> Spark committer guidelines, currently at
>>> https://spark.apache.org/committers.html.
>>>
>>> ** START OF CHANGE **
>>>
>>> PRs shall not be merged during active, on-topic discussion unless they
>>> address issues such as critical security fixes of a public vulnerability.
>>> Under extenuating circumstances, PRs may be merged during active, off-topic
>>> discussion and the discussion directed to a more appropriate venue. Time
>>> should be given prior to merging for those involved with the conversation
>>> to explain if they believe they are on-topic.
>>>
>>> Lazy consensus requires giving time for discussion to settle while
>>> understanding that people may not be working on Spark as their full-time
>>> job and may take holidays. It is believed that by doing this, we can limit
>>> how often people feel the need to exercise their veto.
>>>
>>> All -1s with justification merit discussion.  A -1 from a non-committer
>>> can be overridden only with input from multiple committers, and suitable
>>> time must be offered for any committer to raise concerns. A -1 from a
>>> committer who cannot be reached requires a consensus vote of the PMC under
>>> ASF voting rules to determine the next steps within the ASF guidelines for
>>> code vetoes ( https://www.apache.org/foundation/voting.html ).
>>>
>>> These policies serve to reiterate the core principle that code must not
>>> be merged with a pending veto or before a consensus has been reached (lazy
>>> or otherwise).
>>>
>>> It is the PMC’s hope that vetoes continue to be infrequent, and when
>>> they occur, that all parties will take the time to build consensus prior to
>>> additional feature work.
>>>
>>> Being a committer means exercising your judgement while working in a
>>> community of people with diverse views. There is nothing wrong in getting a
>>> second (or third or fourth) opinion when you are uncertain. Thank you for
>>> your dedication to the Spark project; it is appreciated by the developers
>>> and users of Spark.
>>>
>>> It is hoped that these guidelines do not slow down development; rather,
>>> by removing some of the uncertainty, the goal is to make it easier for us
>>> to reach consensus. If you have ideas on how to improve these guidelines or
>>> other Spark project operating procedures, you should reach out on the dev@
>>> list to start the discussion.
>>>
>>> ** END OF CHANGE TEXT **
>>>
>>> I want to thank everyone who has been involved with the discussion
>>> leading to this proposal and those of you who take the time to vote on
>>> this. I look forward to our continued collaboration in building Apache
>>> Spark.
>>>
>>> I believe we share the goal of creating a welcoming community around the
>>> project. On a personal note, it is my belief that consistently applying
>>> this policy around commits can help to make a more accessible and welcoming
>>> community.
>>>
>>> Kind Regards,
>>>
>>> Holden
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> 
>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-14 Thread Prashant Sharma
Hi Folks,

So, I am back, and searched the JIRAS with target version as "2.4.7" and
Resolved, found only 2 jiras. So, are we good to go, with just a couple of
jiras fixed ? Shall I proceed with making a RC?

Thanks,
Prashant

On Thu, Jul 2, 2020 at 5:23 PM Prashant Sharma  wrote:

> Thank you, Holden.
>
> Folks, My health has gone down a bit. So, I will start working on this in
> a few days. If this needs to be published sooner, then maybe someone else
> has to help out.
>
>
>
>
>
> On Thu, Jul 2, 2020 at 10:11 AM Holden Karau  wrote:
>
>> I’m happy to have Prashant do 2.4.7 :)
>>
>> On Wed, Jul 1, 2020 at 9:40 PM Xiao Li  wrote:
>>
>>> +1 on releasing both 3.0.1 and 2.4.7
>>>
>>> Great! Three committers volunteer to be a release manager. Ruifeng,
>>> Prashant and Holden. Holden just helped release Spark 2.4.6. This time,
>>> maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7
>>> respectively.
>>>
>>> Xiao
>>>
>>> On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> https://issues.apache.org/jira/browse/SPARK-32148 was reported
>>>> yesterday, and if the report is valid it looks to be a blocker. I'll try to
>>>> take a look sooner.
>>>>
>>>> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
>>>> shiva...@eecs.berkeley.edu> wrote:
>>>>
>>>>> Thanks Holden -- it would be great to also get 2.4.7 started
>>>>>
>>>>> Thanks
>>>>> Shivaram
>>>>>
>>>>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau 
>>>>> wrote:
>>>>> >
>>>>> > I can take care of 2.4.7 unless someone else wants to do it.
>>>>> >
>>>>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore <
>>>>> jason.mo...@quantium.com.au> wrote:
>>>>> >>
>>>>> >> Hi all,
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Could I get some input on the severity of this one that I found
>>>>> yesterday?  If that’s a correctness issue, should it block this patch?  
>>>>> Let
>>>>> me know under the ticket if there’s more info that I can provide to help.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> https://issues.apache.org/jira/browse/SPARK-32136
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> Thanks,
>>>>> >>
>>>>> >> Jason.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> From: Jungtaek Lim 
>>>>> >> Date: Wednesday, 1 July 2020 at 10:20 am
>>>>> >> To: Shivaram Venkataraman 
>>>>> >> Cc: Prashant Sharma , 郑瑞峰 <
>>>>> ruife...@foxmail.com>, Gengliang Wang ,
>>>>> gurwls223 , Dongjoon Hyun <
>>>>> dongjoon.h...@gmail.com>, Jules Damji , Holden
>>>>> Karau , Reynold Xin ,
>>>>> Yuanjian Li , "dev@spark.apache.org" <
>>>>> dev@spark.apache.org>, Takeshi Yamamuro 
>>>>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> SPARK-32130 [1] looks to be a performance regression introduced in
>>>>> Spark 3.0.0, which is ideal to look into before releasing another bugfix
>>>>> version.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
>>>>> shiva...@eecs.berkeley.edu> wrote:
>>>>> >>
>>>>> >> Hi all
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> I just wanted to ping this thread to see if all the outstanding
>>>>> blockers for 3.0.1 have been fixed. If so, it would be great if we can get
>>>>> the release going. The CRAN team sent us a note that the version SparkR
>>>>> available on CRAN for the current R ver

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Prashant Sharma
Congratulations all ! It's great to have such committed folks as
committers. :)

On Wed, Jul 15, 2020 at 9:24 AM Yi Wu  wrote:

> Congrats!!
>
> On Wed, Jul 15, 2020 at 8:02 AM Hyukjin Kwon  wrote:
>
>> Congrats!
>>
>> 2020년 7월 15일 (수) 오전 7:56, Takeshi Yamamuro 님이 작성:
>>
>>> Congrats, all!
>>>
>>> On Wed, Jul 15, 2020 at 5:15 AM Takuya UESHIN 
>>> wrote:
>>>
 Congrats and welcome!

 On Tue, Jul 14, 2020 at 1:07 PM Bryan Cutler  wrote:

> Congratulations and welcome!
>
> On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang 
> wrote:
>
>> Welcome, Huaxin, Jungtaek, and Dilip!
>>
>> Congratulations!
>>
>> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia <
>> matei.zaha...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> The Spark PMC recently voted to add several new committers. Please
>>> join me in welcoming them to their new roles! The new committers are:
>>>
>>> - Huaxin Gao
>>> - Jungtaek Lim
>>> - Dilip Biswal
>>>
>>> All three of them contributed to Spark 3.0 and we’re excited to have
>>> them join the project.
>>>
>>> Matei and the Spark PMC
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

 --
 Takuya UESHIN


>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-02 Thread Prashant Sharma
Thank you, Holden.

Folks, My health has gone down a bit. So, I will start working on this in a
few days. If this needs to be published sooner, then maybe someone else has
to help out.





On Thu, Jul 2, 2020 at 10:11 AM Holden Karau  wrote:

> I’m happy to have Prashant do 2.4.7 :)
>
> On Wed, Jul 1, 2020 at 9:40 PM Xiao Li  wrote:
>
>> +1 on releasing both 3.0.1 and 2.4.7
>>
>> Great! Three committers volunteer to be a release manager. Ruifeng,
>> Prashant and Holden. Holden just helped release Spark 2.4.6. This time,
>> maybe, Ruifeng and Prashant can be the release manager of 3.0.1 and 2.4.7
>> respectively.
>>
>> Xiao
>>
>> On Wed, Jul 1, 2020 at 2:24 PM Jungtaek Lim 
>> wrote:
>>
>>> https://issues.apache.org/jira/browse/SPARK-32148 was reported
>>> yesterday, and if the report is valid it looks to be a blocker. I'll try to
>>> take a look sooner.
>>>
>>> On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>>> Thanks Holden -- it would be great to also get 2.4.7 started
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau 
>>>> wrote:
>>>> >
>>>> > I can take care of 2.4.7 unless someone else wants to do it.
>>>> >
>>>> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore <
>>>> jason.mo...@quantium.com.au> wrote:
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >>
>>>> >>
>>>> >> Could I get some input on the severity of this one that I found
>>>> yesterday?  If that’s a correctness issue, should it block this patch?  Let
>>>> me know under the ticket if there’s more info that I can provide to help.
>>>> >>
>>>> >>
>>>> >>
>>>> >> https://issues.apache.org/jira/browse/SPARK-32136
>>>> >>
>>>> >>
>>>> >>
>>>> >> Thanks,
>>>> >>
>>>> >> Jason.
>>>> >>
>>>> >>
>>>> >>
>>>> >> From: Jungtaek Lim 
>>>> >> Date: Wednesday, 1 July 2020 at 10:20 am
>>>> >> To: Shivaram Venkataraman 
>>>> >> Cc: Prashant Sharma , 郑瑞峰 <
>>>> ruife...@foxmail.com>, Gengliang Wang ,
>>>> gurwls223 , Dongjoon Hyun ,
>>>> Jules Damji , Holden Karau ,
>>>> Reynold Xin , Yuanjian Li ,
>>>> "dev@spark.apache.org" , Takeshi Yamamuro <
>>>> linguin@gmail.com>
>>>> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
>>>> >>
>>>> >>
>>>> >>
>>>> >> SPARK-32130 [1] looks to be a performance regression introduced in
>>>> Spark 3.0.0, which is ideal to look into before releasing another bugfix
>>>> version.
>>>> >>
>>>> >>
>>>> >>
>>>> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
>>>> shiva...@eecs.berkeley.edu> wrote:
>>>> >>
>>>> >> Hi all
>>>> >>
>>>> >>
>>>> >>
>>>> >> I just wanted to ping this thread to see if all the outstanding
>>>> blockers for 3.0.1 have been fixed. If so, it would be great if we can get
>>>> the release going. The CRAN team sent us a note that the version SparkR
>>>> available on CRAN for the current R version (4.0.2) is broken and hence we
>>>> need to update the package soon --  it will be great to do it with 3.0.1.
>>>> >>
>>>> >>
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> Shivaram
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma <
>>>> scrapco...@gmail.com> wrote:
>>>> >>
>>>> >> +1 for 3.0.1 release.
>>>> >>
>>>> >> I too can help out as release manager.
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  w

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-24 Thread Prashant Sharma
+1 for 3.0.1 release.
I too can help out as release manager.

On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:

> I volunteer to be a release manager of 3.0.1, if nobody is working on this.
>
>
> -- 原始邮件 --
> *发件人:* "Gengliang Wang";
> *发送时间:* 2020年6月24日(星期三) 下午4:15
> *收件人:* "Hyukjin Kwon";
> *抄送:* "Dongjoon Hyun";"Jungtaek Lim"<
> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
> Karau";"Reynold Xin";"Shivaram
> Venkataraman";"Yuanjian Li"<
> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
> Yamamuro";
> *主题:* Re: [DISCUSS] Apache Spark 3.0.1 Release
>
> +1, the issues mentioned are really serious.
>
> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon  wrote:
>
>> +1.
>>
>> Just as a note,
>> - SPARK-31918  is
>> fixed now, and there's no blocker. - When we build SparkR, we should use
>> the latest R version at least 4.0.0+.
>>
>> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이 작성:
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 +1 on a 3.0.1 soon.

 Probably it would be nice if some Scala experts can take a look at
 https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
 into 3.0.1 if possible.
 Looks like APIs designed to work with Scala 2.11 & Java bring
 ambiguity in Scala 2.12 & Java.

 On Wed, Jun 24, 2020 at 4:52 AM Jules Damji 
 wrote:

> +1 (non-binding)
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jun 23, 2020, at 11:36 AM, Holden Karau 
> wrote:
>
> 
> +1 on a patch release soon
>
> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
> wrote:
>
>> +1 on doing a new patch release soon. I saw some of these issues when
>> preparing the 3.0 release, and some of them are very serious.
>>
>>
>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
>>> soon.
>>>
>>> Shivaram
>>>
>>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro <
>>> linguin@gmail.com> wrote:
>>>
>>> Thanks for the heads-up, Yuanjian!
>>>
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>>
>>> wow, the updates are so quick. Anyway, +1 for the release.
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
>>> wrote:
>>>
>>> Hi dev-list,
>>>
>>> I’m writing this to raise the discussion about Spark 3.0.1
>>> feasibility since 4 blocker issues were found after Spark 3.0.0:
>>>
>>> [SPARK-31990] The state store compatibility broken will cause a
>>> correctness issue when Streaming query with `dropDuplicate` uses the
>>> checkpoint written by the old Spark version.
>>>
>>> [SPARK-32038] The regression bug in handling NaN values in
>>> COUNT(DISTINCT)
>>>
>>> [SPARK-31918][WIP] CRAN requires to make it working with the latest
>>> R 4.0. It makes the 3.0 release unavailable on CRAN, and only supports R
>>> [3.5, 4.0)
>>>
>>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time
>>> regression
>>>
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>> I think it would be great if we have Spark 3.0.1 to deliver the critical
>>> fixes.
>>>
>>> Any comments are appreciated.
>>>
>>> Best,
>>>
>>> Yuanjian
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>


Re: [vote] Apache Spark 3.0 RC3

2020-06-06 Thread Prashant Sharma
+1

On Sun, Jun 7, 2020 at 1:50 AM Reynold Xin  wrote:

> Apologies for the mistake. The vote is open till 11:59pm Pacific time on
> Mon June 9th.
>
> On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.0.0.
>>
>> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes
>> are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.0.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.0.0-rc3 (commit
>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>> https://github.com/apache/spark/tree/v3.0.0-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1350/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>>
>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>
>> This release is using the release script of the tag v3.0.0-rc3.
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.0.0?
>> ===
>>
>> The current list of open tickets targeted at 3.0.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>


Re: [VOTE] Release Spark 2.4.6 (RC8)

2020-06-01 Thread Prashant Sharma
+1

Thanks!

On Mon, Jun 1, 2020 at 10:50 AM Holden Karau  wrote:

> Yes thats correct, the release script needs a bit of work and it's
> diverged a bit from 3.0 as well. I'll follow up with some more PRs in
> addition to the current one I have.
>
> On Sun, May 31, 2020 at 10:08 PM Sean Owen  wrote:
>
>> I suspect there were some problems with the release script to fix.
>>
>> +1 from me, same as last time. This still appears to be OK in licenses
>> and sigs, and source compiles and passes tests.
>>
>> On Sun, May 31, 2020 at 11:23 PM Wenchen Fan  wrote:
>>
>>> +1 (binding), although I don't know why we jump from RC 3 to RC 8...
>>>
>>> On Mon, Jun 1, 2020 at 7:47 AM Holden Karau 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.4.6.

 The vote is open until June 5th at 9AM PST and passes if a majority +1
 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 2.4.6
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 There are currently no issues targeting 2.4.6 (try project = SPARK AND
 "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))

 The tag to be voted on is v2.4.6-rc8 (commit
 807e0a484d1de767d1f02bd8a622da6450bdf940):
 https://github.com/apache/spark/tree/v2.4.6-rc8

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1349/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc8-docs/

 The list of bug fixes going into 2.4.6 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12346781

 This release is using the release script of the tag v2.4.6-rc8.

 FAQ

 =
 What happened to the other RCs?
 =

 The parallel maven build caused some flakiness so I wasn't comfortable
 releasing them. I backported the fix from the 3.0 branch for this release.
 I've got a proposed change to the build script so that we only push tags
 when once the build is a success for the future, but it does not block this
 release.

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with an out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.4.6?
 ===

 The current list of open tickets targeted at 2.4.6 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 2.4.6

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.


 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-17 Thread Prashant Sharma
+1,

Looks good. Thank you for putting this together.

On Sat, May 16, 2020 at 10:38 AM Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.6.
>
> The vote is open until May 22nd at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.6
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.6 (try project = SPARK AND
> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In
> Progress"))
>
> The tag to be voted on is v2.4.6-rc3 (commit
> 570848da7c48ba0cb827ada997e51677ff672a39):
> https://github.com/apache/spark/tree/v2.4.6-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1344/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc3-docs/
>
> The list of bug fixes going into 2.4.6 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12346781
>
> This release is using the release script of the tag v2.4.6-rc3.
>
> FAQ
>
> =
> What happened to RC2?
> =
>
> My computer crashed part of the way through RC2, so I rolled RC3.
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.6?
> ===
>
> The current list of open tickets targeted at 2.4.6 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.6
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-12 Thread Prashant Sharma
Hi Holden,

I am +1 on this release, the fix for SPARK-31663 can make it to next
release as well.

Thanks,

On Tue, May 12, 2020 at 8:09 PM Holden Karau  wrote:

> Thanks. The 2.4.6 RC1 vote fails because we don’t have enough binding +1s,
> I’ll start a new RC once 31663 is merged or next week whichever is first.
>
> On Tue, May 12, 2020 at 7:28 AM Yuanjian Li 
> wrote:
>
>> Thanks Holden and Dongjoon for the help!
>> The bugfix for SPARK-31663 is ready for review, hope it can be picked up
>> in 2.4.7 if possible.
>> https://github.com/apache/spark/pull/28501
>>
>> Best,
>> Yuanjian
>>
>> Takeshi Yamamuro  于2020年5月11日周一 上午9:03写道:
>>
>>> I checked on my MacOS env; all the tests
>>> with `-Pyarn -Phadoop-2.7 -Pdocker-integration-tests -Phive
>>> -Phive-thriftserver -Pmesos -Pkubernetes -Psparkr`
>>> passed and I couldn't find any issue;
>>>
>>> maropu@~:$java -version
>>> java version "1.8.0_181"
>>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>>
>>> Bests,
>>> Takeshi
>>>
>>>
>>> On Sun, May 10, 2020 at 2:50 AM Holden Karau 
>>> wrote:
>>>
 Thanks Dongjoon :)
 So it’s not a regression, but if it won’t be a large delay I think
 holding for the correctness fix would be good (and we can pick up the two
 issues fixed in 2.4.7). What does everyone think?

 On Fri, May 8, 2020 at 11:40 AM Dongjoon Hyun 
 wrote:

> I confirmed and update the JIRA. SPARK-31663 is a correctness issue
> since Apache Spark 2.4.0.
>
> Bests,
> Dongjoon.
>
> On Fri, May 8, 2020 at 10:26 AM Holden Karau 
> wrote:
>
>> Can you provide a bit more context (is it a regression?)
>>
>> On Fri, May 8, 2020 at 9:33 AM Yuanjian Li 
>> wrote:
>>
>>> Hi Holden,
>>>
>>> I'm working on the bugfix of SPARK-31663
>>> , let me post it
>>> here since it's a correctness bug and also affects 2.4.6.
>>>
>>> Best,
>>> Yuanjian
>>>
>>> Sean Owen  于2020年5月8日周五 下午11:42写道:
>>>
 +1 from me. The usual: sigs OK, license looks as intended, tests
 pass
 from a source build for me.

 On Thu, May 7, 2020 at 1:29 PM Holden Karau 
 wrote:
 >
 > Please vote on releasing the following candidate as Apache Spark
 version 2.4.6.
 >
 > The vote is open until February 5th 11PM PST and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
 >
 > [ ] +1 Release this package as Apache Spark 2.4.6
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see
 http://spark.apache.org/
 >
 > There are currently no issues targeting 2.4.6 (try project =
 SPARK AND "Target Version/s" = "2.4.6" AND status in (Open, Reopened, 
 "In
 Progress"))
 >
 > We _may_ want to hold the 2.4.6 release for something targetted
 to 2.4.7 ( project = SPARK AND "Target Version/s" = "2.4.7") , 
 currently,
 SPARK-24266 & SPARK-26908 and I believe there is some discussion on if 
 we
 should include SPARK-31399 in this release.
 >
 > The tag to be voted on is v2.4.5-rc2 (commit
 a3cffc997035d11e1f6c092c1186e943f2f63544):
 > https://github.com/apache/spark/tree/v2.4.6-rc1
 >
 > The release files, including signatures, digests, etc. can be
 found at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
 >
 > Signatures used for Spark RCs can be found in this file:
 > https://dist.apache.org/repos/dist/dev/spark/KEYS
 >
 > The staging repository for this release can be found at:
 >
 https://repository.apache.org/content/repositories/orgapachespark-1340/
 >
 > The documentation corresponding to this release can be found at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
 >
 > The list of bug fixes going into 2.4.6 can be found at the
 following URL:
 > https://issues.apache.org/jira/projects/SPARK/versions/12346781
 >
 > This release is using the release script of the tag v2.4.6-rc1.
 >
 > FAQ
 >
 > =
 > How can I help test this release?
 > =
 >
 > If you are a Spark user, you can help us test this release by
 taking
 > an existing Spark workload and running on this release candidate,
 then
 > reporting any regressions.
 >
 > If you're working in PySpark you can set up a virtual env and
 install
 > the current RC and see if anything important breaks, in the
 

[DISCUSS][K8s] Copy files securely to the pods or containers.

2020-03-27 Thread Prashant Sharma
Hello All,
The issue SPARK-23153 
lets us copy any file to the pod/container, by first copying it to a hadoop
supported filesystem e.g. HDFS, s3, cos etc. This is especially useful if,
the files have to be copied to large number of pods/nodes.  However, in
most cases we need the file to be copied only to the driver, it may not be
always convenient (esp. in case of clusters with smaller no. of nodes or
limited resources), to setup an additional intermediate storage just for
this, it cannot work without an intermediate distributed storage of some
sort.
So, while going through the code of kubectl cp command

. It appears, that we can use the same technique using
tar cf - /tmp/foo | kubectl exec -i -n   -- tar
xf - -C /tmp/bar to copy files in a more secure way (because the file goes
through kubernetes API, which has its own security in place)
This also lets us compress the file while sending.

If there is any interest in this sort of feature, I am ready to open an
issue and work on it. So let us discuss, if this has already been explored
and there are some known issues with this approach.

Thank you,
Prashant.


Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-05 Thread Prashant Sharma
It was by design, one could run multiple workers on his laptop for trying
out or testing spark in distributed mode, one could launch multiple workers
and see how resource offers and requirements work. Certainly, I have not
commonly seen, starting multiple workers on the same node as a practice so
far.

Why do we consider it as a special case for scheduling, where two workers
are on the same node than two different nodes? Possibly, optimize on
network I/o and disk I/O?

On Tue, Mar 3, 2020 at 12:45 AM Xingbo Jiang  wrote:

> Thanks Sean for your input, I really think it could simplify Spark
> Standalone backend a lot by only allowing a single worker on the same host,
> also I can confirm this deploy model can satisfy all the workloads deployed
> on Standalone backend AFAIK.
>
> Regarding the case multiple distinct Spark clusters running a worker on
> one machine, I'm not sure whether that's something we have claimed to
> support, could someone with more context on this scenario share their use
> case?
>
> Cheers,
>
> Xingbo
>
> On Fri, Feb 28, 2020 at 11:29 AM Sean Owen  wrote:
>
>> I'll admit, I didn't know you could deploy multiple workers per
>> machine. I agree, I don't see the use case for it? multiple executors,
>> yes of course. And I guess you could imagine multiple distinct Spark
>> clusters running a worker on one machine. I don't have an informed
>> opinion therefore, but agree that it seems like a best practice enough
>> to enforce 1 worker per machine, if it makes things simpler rather
>> than harder.
>>
>> On Fri, Feb 28, 2020 at 1:21 PM Xingbo Jiang 
>> wrote:
>> >
>> > Hi all,
>> >
>> > Based on my experience, there is no scenario that necessarily requires
>> deploying multiple Workers on the same node with Standalone backend. A
>> worker should book all the resources reserved to Spark on the host it is
>> launched, then it can allocate those resources to one or more executors
>> launched by this worker. Since each executor runs in a separated JVM, we
>> can limit the memory of each executor to avoid long GC pause.
>> >
>> > The remaining concern is the local-cluster mode is implemented by
>> launching multiple workers on the local host, we might need to re-implement
>> LocalSparkCluster to launch only one Worker and multiple executors. It
>> should be fine because local-cluster mode is only used in running Spark
>> unit test cases, thus end users should not be affected by this change.
>> >
>> > Removing multiple workers on the same host support could simplify the
>> deploy model of Standalone backend, and also reduce the burden to support
>> legacy deploy pattern in the future feature developments. (There is an
>> example in https://issues.apache.org/jira/browse/SPARK-27371 , where we
>> designed a complex approach to coordinate resource requirements from
>> different workers launched on the same host).
>> >
>> > The proposal is to update the document to deprecate the support of
>> system environment `SPARK_WORKER_INSTANCES` in Spark 3.0, and remove the
>> support in the next major version (Spark 3.1).
>> >
>> > Please kindly let me know if you have use cases relying on this feature.
>> >
>> > Thanks!
>> >
>> > Xingbo
>>
>


Re: [DISCUSS] Shall we mark spark streaming component as deprecated.

2020-03-02 Thread Prashant Sharma
I may have speculated, or believed the unauthorised sources, nevertheless I
am happy to be corrected.

On Mon, Mar 2, 2020 at 8:05 PM Sean Owen  wrote:

> Er, who says it's deprecated? I have never heard anything like that.
> Why would it be?
>
> On Mon, Mar 2, 2020 at 4:52 AM Prashant Sharma 
> wrote:
> >
> > Hi All,
> >
> > It is noticed that some of the users of Spark streaming do not
> immediately realise that it is a deprecated component and it would be
> scary, if they end up with it in production. Now that we are in a position
> to release about Spark 3.0.0, may be we should discuss - should the spark
> streaming carry an explicit notice? That it is deprecated and not under
> active development.
> >
> > I have opened an issue already, but I think a mailing list discussion
> would be more appropriate.
> https://issues.apache.org/jira/browse/SPARK-31006
> >
> > Thanks,
> > Prashant.
> >
>


[DISCUSS] Shall we mark spark streaming component as deprecated.

2020-03-02 Thread Prashant Sharma
Hi All,

It is noticed that some of the users of Spark streaming do not immediately
realise that it is a deprecated component and it would be scary, if they
end up with it in production. Now that we are in a position to release
about Spark 3.0.0, may be we should discuss - should the spark streaming
carry an explicit notice? That it is deprecated and not under active
development.

I have opened an issue already, but I think a mailing list discussion would
be more appropriate. https://issues.apache.org/jira/browse/SPARK-31006

Thanks,
Prashant.


Re: Kafka Spark structured streaming latency benchmark.

2017-01-02 Thread Prashant Sharma
This issue was fixed in https://issues.apache.org/jira/browse/SPARK-18991.

--Prashant


On Tue, Dec 20, 2016 at 6:16 PM, Prashant Sharma <scrapco...@gmail.com>
wrote:

> Hi Shixiong,
>
> Thanks for taking a look, I am trying to run and see if making
> ContextCleaner run more frequently and/or making it non blocking will help.
>
> --Prashant
>
>
> On Tue, Dec 20, 2016 at 4:05 AM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Hey Prashant. Thanks for your codes. I did some investigation and it
>> turned out that ContextCleaner is too slow and its "referenceQueue" keeps
>> growing. My hunch is cleaning broadcast is very slow since it's a blocking
>> call.
>>
>> On Mon, Dec 19, 2016 at 12:50 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> Hey, Prashant. Could you track the GC root of byte arrays in the heap?
>>>
>>> On Sat, Dec 17, 2016 at 10:04 PM, Prashant Sharma <scrapco...@gmail.com>
>>> wrote:
>>>
>>>> Furthermore, I ran the same thing with 26 GB as the memory, which would
>>>> mean 1.3GB per thread of memory. My jmap
>>>> <https://github.com/ScrapCodes/KafkaProducer/blob/master/data/26GB/t11_jmap-histo>
>>>> results and jstat
>>>> <https://github.com/ScrapCodes/KafkaProducer/blob/master/data/26GB/t11_jstat>
>>>> results collected after running the job for more than 11h, again show a
>>>> memory constraint. The same gradual slowdown, but a bit more gradual as
>>>> memory is considerably more than the previous runs.
>>>>
>>>>
>>>>
>>>>
>>>> This situation sounds like a memory leak ? As the byte array objects
>>>> are more than 13GB, and are not garbage collected.
>>>>
>>>> --Prashant
>>>>
>>>>
>>>> On Sun, Dec 18, 2016 at 8:49 AM, Prashant Sharma <scrapco...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Goal of my benchmark is to arrive at end to end latency lower than
>>>>> 100ms and sustain them over time, by consuming from a kafka topic and
>>>>> writing back to another kafka topic using Spark. Since the job does not do
>>>>> aggregation and does a constant time processing on each message, it
>>>>> appeared to me as an achievable target. But, then there are some 
>>>>> surprising
>>>>> and interesting pattern to observe.
>>>>>
>>>>>  Basically, it has four components namely,
>>>>> 1) kafka
>>>>> 2) Long running kafka producer, rate limited to 1000 msgs/sec, with
>>>>> each message of about 1KB.
>>>>> 3) Spark  job subscribed to `test` topic and writes out to another
>>>>> topic `output`.
>>>>> 4) A Kafka consumer, reading from the `output` topic.
>>>>>
>>>>> How the latency was measured ?
>>>>>
>>>>> While sending messages from kafka producer, each message is embedded
>>>>> the timestamp at which it is pushed to the kafka `test` topic. Spark
>>>>> receives each message and writes them out to `output` topic as is. When
>>>>> these messages arrive at Kafka consumer, their embedded time is subtracted
>>>>> from the time of arrival at the consumer and a scatter plot of the same is
>>>>> attached.
>>>>>
>>>>> The scatter plots sample only 10 minutes of data received during
>>>>> initial one hour and then again 10 minutes of data received after 2 hours
>>>>> of run.
>>>>>
>>>>>
>>>>>
>>>>> These plots indicate a significant slowdown in latency, in the later
>>>>> scatter plot indicate almost all the messages were received with a delay
>>>>> larger than 2 seconds. However, first plot show that most messages arrived
>>>>> in less than 100ms latency. The two samples were taken with time 
>>>>> difference
>>>>> of 2 hours approx.
>>>>>
>>>>> After running the test for 24 hours, the jstat
>>>>> <https://raw.githubusercontent.com/ScrapCodes/KafkaProducer/master/data/jstat_output.txt>
>>>>> and jmap
>>>>> <https://raw.githubusercontent.com/ScrapCodes/KafkaProducer/master/data/jmap_output.txt>
>>>>>  output
>>>>> for the jobs indicate possibility  of memory constrains. To be more clear,
>>>>> job was run with local[20] and memory of 5GB(spark.driver.memory). The job
>>>>> is straight forward and located here: https://github.com/ScrapCodes/
>>>>> KafkaProducer/blob/master/src/main/scala/com/github/scrapcod
>>>>> es/kafka/SparkSQLKafkaConsumer.scala .
>>>>>
>>>>>
>>>>> What is causing the gradual slowdown? I need help in diagnosing the
>>>>> problem.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --Prashant
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Task not Serializable Exception

2017-01-02 Thread Prashant Sharma
Can you minimize the code snippet with which we can get this
`NotSerializableException`
exception?

Thanks,


-
Prashant Sharma
Spark Technology Center
http://www.spark.tc/
--


On Sun, Jan 1, 2017 at 9:36 AM, khyati <khyati.s...@guavus.com> wrote:

> Getting error for the following code snippet:
>
> object SparkTaskTry extends Logging {
>   63   /**
>   64* Extends the normal Try constructor to allow TaskKilledExceptions
> to propagate
>   65*/
>   66   def apply[T](r: => T): Try[T] =
>   67 try scala.util.Success(r) catch {
>   68   case e: TaskKilledException => throw e
>   69   case NonFatal(e) =>
>   70 logInfo("Caught and Ignored Exception: " + e.toString)
>   71 e.printStackTrace()
>   72 Failure(e)
>   73 }
>   74 }
>
> override def buildScan(
>  349   requiredColumns: Array[String],
>  350   filters: Array[Filter],
>  351   inputFiles: Array[FileStatus],
>  352   broadcastedConf: Broadcast[SerializableConfiguration]):
> RDD[Row]
> = {
>  353 val useMetadataCache =
> sqlContext.getConf(SQLConf.PARQUET_CACHE_METADATA)
>  354 val parquetFilterPushDown = sqlContext.conf.parquetFilterPushDown
>  355 val assumeBinaryIsString = sqlContext.conf.
> isParquetBinaryAsString
>  356 val assumeInt96IsTimestamp =
> sqlContext.conf.isParquetINT96AsTimestamp
>  357 val followParquetFormatSpec =
> sqlContext.conf.followParquetFormatSpec
>  358
>  359 // When merging schemas is enabled and the column of the given
> filter does not exist,
>  360 // Parquet emits an exception which is an issue of Parquet
> (PARQUET-389).
>  361 val safeParquetFilterPushDown = !shouldMergeSchemas &&
> parquetFilterPushDown
>  362
>  363 // Parquet row group size. We will use this value as the value for
>  364 // mapreduce.input.fileinputformat.split.minsize and
> mapred.min.split.size if the value
>  365 // of these flags are smaller than the parquet row group size.
>  366 val parquetBlockSize =
> ParquetOutputFormat.getLongBlockSize(broadcastedConf.value.value)
>  367
>  368 // Create the function to set variable Parquet confs at both
> driver
> and executor side.
>  369 val initLocalJobFuncOpt =
>  370   ParquetRelation.initializeLocalJobFunc(
>  371 requiredColumns,
>  372 filters,
>  373 dataSchema,
>  374 parquetBlockSize,
>  375 useMetadataCache,
>  376 safeParquetFilterPushDown,
>  377 assumeBinaryIsString,
>  378 assumeInt96IsTimestamp,
>  379 followParquetFormatSpec) _
>  380
>  381 // Create the function to set input paths at the driver side.
>  382 val setInputPaths =
>  383   ParquetRelation.initializeDriverSideJobFunc(inputFiles,
> parquetBlockSize) _
>  384
>  385 Utils.withDummyCallSite(sqlContext.sparkContext) {
>  386   new RDD[Try[InternalRow]](sqlContext.sparkContext, Nil) with
> Logging {
>  387
>  388 override def getPartitions: Array[SparkPartition] =
> internalRDD.getPartitions
>  389
>  390 override def getPreferredLocations(split: SparkPartition):
> Seq[String] =
>  391   internalRDD.getPreferredLocations(split)
>  392
>  393 override def checkpoint() {
>  394   // Do nothing. Hadoop RDD should not be checkpointed.
>  395 }
>  396
>  397 override def persist(storageLevel: StorageLevel): this.type =
> {
>  398   super.persist(storageLevel)
>  399 }
>  400
>  401 val internalRDD: SqlNewHadoopRDD[InternalRow] = new
> SqlNewHadoopRDD(
>  402 sc = sqlContext.sparkContext,
>  403 broadcastedConf = broadcastedConf,
>  404 initDriverSideJobFuncOpt = Some(setInputPaths),
>  405 initLocalJobFuncOpt = Some(initLocalJobFuncOpt),
>  406 inputFormatClass = if (isSplittable) {
>  407   classOf[ParquetInputFormat[InternalRow]]
>  408 } else {
>  409   classOf[ParquetRowInputFormatIndivisible]
>  410 },
>  411 valueClass = classOf[InternalRow]) {
>  412
>  413 val cacheMetadata = useMetadataCache
>  414
>  415 @transient val cachedStatuses = inputFiles.map { f =>
>  416   // In order to encode the authority of a Path containing
> special characters such as '/'
>  417   // (which does happen in some S3N credentials), we need to
> use the string returned by the
>  418   // URI of the path to create a new Path.
>  419   val pathWithEscapedAuthority = escapePathUserInfo(f.getPath)
>  420   new FileStatus(
>  421   

Re: Kafka Spark structured streaming latency benchmark.

2016-12-20 Thread Prashant Sharma
Hi Shixiong,

Thanks for taking a look, I am trying to run and see if making
ContextCleaner run more frequently and/or making it non blocking will help.

--Prashant


On Tue, Dec 20, 2016 at 4:05 AM, Shixiong(Ryan) Zhu <shixi...@databricks.com
> wrote:

> Hey Prashant. Thanks for your codes. I did some investigation and it
> turned out that ContextCleaner is too slow and its "referenceQueue" keeps
> growing. My hunch is cleaning broadcast is very slow since it's a blocking
> call.
>
> On Mon, Dec 19, 2016 at 12:50 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Hey, Prashant. Could you track the GC root of byte arrays in the heap?
>>
>> On Sat, Dec 17, 2016 at 10:04 PM, Prashant Sharma <scrapco...@gmail.com>
>> wrote:
>>
>>> Furthermore, I ran the same thing with 26 GB as the memory, which would
>>> mean 1.3GB per thread of memory. My jmap
>>> <https://github.com/ScrapCodes/KafkaProducer/blob/master/data/26GB/t11_jmap-histo>
>>> results and jstat
>>> <https://github.com/ScrapCodes/KafkaProducer/blob/master/data/26GB/t11_jstat>
>>> results collected after running the job for more than 11h, again show a
>>> memory constraint. The same gradual slowdown, but a bit more gradual as
>>> memory is considerably more than the previous runs.
>>>
>>>
>>>
>>>
>>> This situation sounds like a memory leak ? As the byte array objects are
>>> more than 13GB, and are not garbage collected.
>>>
>>> --Prashant
>>>
>>>
>>> On Sun, Dec 18, 2016 at 8:49 AM, Prashant Sharma <scrapco...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Goal of my benchmark is to arrive at end to end latency lower than
>>>> 100ms and sustain them over time, by consuming from a kafka topic and
>>>> writing back to another kafka topic using Spark. Since the job does not do
>>>> aggregation and does a constant time processing on each message, it
>>>> appeared to me as an achievable target. But, then there are some surprising
>>>> and interesting pattern to observe.
>>>>
>>>>  Basically, it has four components namely,
>>>> 1) kafka
>>>> 2) Long running kafka producer, rate limited to 1000 msgs/sec, with
>>>> each message of about 1KB.
>>>> 3) Spark  job subscribed to `test` topic and writes out to another
>>>> topic `output`.
>>>> 4) A Kafka consumer, reading from the `output` topic.
>>>>
>>>> How the latency was measured ?
>>>>
>>>> While sending messages from kafka producer, each message is embedded
>>>> the timestamp at which it is pushed to the kafka `test` topic. Spark
>>>> receives each message and writes them out to `output` topic as is. When
>>>> these messages arrive at Kafka consumer, their embedded time is subtracted
>>>> from the time of arrival at the consumer and a scatter plot of the same is
>>>> attached.
>>>>
>>>> The scatter plots sample only 10 minutes of data received during
>>>> initial one hour and then again 10 minutes of data received after 2 hours
>>>> of run.
>>>>
>>>>
>>>>
>>>> These plots indicate a significant slowdown in latency, in the later
>>>> scatter plot indicate almost all the messages were received with a delay
>>>> larger than 2 seconds. However, first plot show that most messages arrived
>>>> in less than 100ms latency. The two samples were taken with time difference
>>>> of 2 hours approx.
>>>>
>>>> After running the test for 24 hours, the jstat
>>>> <https://raw.githubusercontent.com/ScrapCodes/KafkaProducer/master/data/jstat_output.txt>
>>>> and jmap
>>>> <https://raw.githubusercontent.com/ScrapCodes/KafkaProducer/master/data/jmap_output.txt>
>>>>  output
>>>> for the jobs indicate possibility  of memory constrains. To be more clear,
>>>> job was run with local[20] and memory of 5GB(spark.driver.memory). The job
>>>> is straight forward and located here: https://github.com/ScrapCodes/
>>>> KafkaProducer/blob/master/src/main/scala/com/github/scrapcod
>>>> es/kafka/SparkSQLKafkaConsumer.scala .
>>>>
>>>>
>>>> What is causing the gradual slowdown? I need help in diagnosing the
>>>> problem.
>>>>
>>>> Thanks,
>>>>
>>>> --Prashant
>>>>
>>>>
>>>
>>
>


Re: Kafka Spark structured streaming latency benchmark.

2016-12-17 Thread Prashant Sharma
Furthermore, I ran the same thing with 26 GB as the memory, which would
mean 1.3GB per thread of memory. My jmap
<https://github.com/ScrapCodes/KafkaProducer/blob/master/data/26GB/t11_jmap-histo>
results and jstat
<https://github.com/ScrapCodes/KafkaProducer/blob/master/data/26GB/t11_jstat>
results collected after running the job for more than 11h, again show a
memory constraint. The same gradual slowdown, but a bit more gradual as
memory is considerably more than the previous runs.




This situation sounds like a memory leak ? As the byte array objects are
more than 13GB, and are not garbage collected.

--Prashant


On Sun, Dec 18, 2016 at 8:49 AM, Prashant Sharma <scrapco...@gmail.com>
wrote:

> Hi,
>
> Goal of my benchmark is to arrive at end to end latency lower than 100ms
> and sustain them over time, by consuming from a kafka topic and writing
> back to another kafka topic using Spark. Since the job does not do
> aggregation and does a constant time processing on each message, it
> appeared to me as an achievable target. But, then there are some surprising
> and interesting pattern to observe.
>
>  Basically, it has four components namely,
> 1) kafka
> 2) Long running kafka producer, rate limited to 1000 msgs/sec, with each
> message of about 1KB.
> 3) Spark  job subscribed to `test` topic and writes out to another topic
> `output`.
> 4) A Kafka consumer, reading from the `output` topic.
>
> How the latency was measured ?
>
> While sending messages from kafka producer, each message is embedded the
> timestamp at which it is pushed to the kafka `test` topic. Spark receives
> each message and writes them out to `output` topic as is. When these
> messages arrive at Kafka consumer, their embedded time is subtracted from
> the time of arrival at the consumer and a scatter plot of the same is
> attached.
>
> The scatter plots sample only 10 minutes of data received during initial
> one hour and then again 10 minutes of data received after 2 hours of run.
>
>
>
> These plots indicate a significant slowdown in latency, in the later
> scatter plot indicate almost all the messages were received with a delay
> larger than 2 seconds. However, first plot show that most messages arrived
> in less than 100ms latency. The two samples were taken with time difference
> of 2 hours approx.
>
> After running the test for 24 hours, the jstat
> <https://raw.githubusercontent.com/ScrapCodes/KafkaProducer/master/data/jstat_output.txt>
> and jmap
> <https://raw.githubusercontent.com/ScrapCodes/KafkaProducer/master/data/jmap_output.txt>
>  output
> for the jobs indicate possibility  of memory constrains. To be more clear,
> job was run with local[20] and memory of 5GB(spark.driver.memory). The job
> is straight forward and located here: https://github.com/ScrapCodes/
> KafkaProducer/blob/master/src/main/scala/com/github/scrapcod
> es/kafka/SparkSQLKafkaConsumer.scala .
>
>
> What is causing the gradual slowdown? I need help in diagnosing the
> problem.
>
> Thanks,
>
> --Prashant
>
>


Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-08 Thread Prashant Sharma
I am getting 404 for Link
https://repository.apache.org/content/repositories/orgapachespark-1217.

--Prashant


On Fri, Dec 9, 2016 at 10:43 AM, Michael Allman 
wrote:

> I believe https://github.com/apache/spark/pull/16122 needs to be included
> in Spark 2.1. It's a simple bug fix to some functionality that is
> introduced in 2.1. Unfortunately, it's been manually verified only. There's
> no unit test that covers it, and building one is far from trivial.
>
> Michael
>
>
>
>
> On Dec 8, 2016, at 12:39 AM, Reynold Xin  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc2 (080717497365b83bc202ab16812ced
> 93eb1ea7bd)
>
> List of JIRA tickets resolved are:  https://issues.apache.
> org/jira/issues/?jql=project%20%3D%20SPARK%20AND%
> 20fixVersion%20%3D%202.1.0
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1217
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>
>
> (Note that the docs and staging repo are still being uploaded and will be
> available soon)
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ===
> What should happen to JIRA tickets still targeting 2.1.0?
> ===
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
>
>


Re: Scala 2.11 default build

2016-02-04 Thread Prashant Sharma
Yes, That should be changed to 2.11.7. Mind sending a patch ?

Prashant Sharma



On Thu, Feb 4, 2016 at 2:11 PM, zzc <441586...@qq.com> wrote:

> hi, rxin, in pom.xml file, 'scala.version' still is 2.10.5, does  it need
> to
> be modified to 2.11.7?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-2-11-default-build-tp16157p16207.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: A proposal for Spark 2.0

2015-11-15 Thread Prashant Sharma
Hey Matei,


> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.


Our REPL specific changes were merged in scala/scala and are available as
part of 2.11.7 and hopefully be part of 2.12 too. If I am not wrong, REPL
stuff is taken care of, we don;t need to keep upgrading REPL code for every
scala release now. http://www.scala-lang.org/news/2.11.7

I am +1 on the proposal for Spark 2.0.

Thanks,


Prashant Sharma



On Thu, Nov 12, 2015 at 3:02 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
> Matei
>
> > On Nov 11, 2015, at 4:58 AM, Sean Owen <so...@cloudera.com> wrote:
> >
> > On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin <r...@databricks.com>
> wrote:
> >> to the Spark community. A major release should not be very different
> from a
> >> minor release and should not be gated based on new features. The main
> >> purpose of a major release is an opportunity to fix things that are
> broken
> >> in the current API and remove certain deprecated APIs (examples follow).
> >
> > Agree with this stance. Generally, a major release might also be a
> > time to replace some big old API or implementation with a new one, but
> > I don't see obvious candidates.
> >
> > I wouldn't mind turning attention to 2.x sooner than later, unless
> > there's a fairly good reason to continue adding features in 1.x to a
> > 1.7 release. The scope as of 1.6 is already pretty darned big.
> >
> >
> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but
> >> it has been end-of-life.
> >
> > By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
> > be quite stable, and 2.10 will have been EOL for a while. I'd propose
> > dropping 2.10. Otherwise it's supported for 2 more years.
> >
> >
> >> 2. Remove Hadoop 1 support.
> >
> > I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
> > sort of 'alpha' and 'beta' releases) and even <2.6.
> >
> > I'm sure we'll think of a number of other small things -- shading a
> > bunch of stuff? reviewing and updating dependencies in light of
> > simpler, more recent dependencies to support from Hadoop etc?
> >
> > Farming out Tachyon to a module? (I felt like someone proposed this?)
> > Pop out any Docker stuff to another repo?
> > Continue that same effort for EC2?
> > Farming out some of the "external" integrations to another repo (?
> > controversial)
> >
> > See also anything marked version "2+" in JIRA.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-09 Thread Prashant Sharma
That is correct !, I have thought about this a lot of times. The only
solution is to implement a "real" cross build for both version. I am going
to think more in this. :)

Prashant Sharma



On Sat, Oct 10, 2015 at 2:04 AM, Patrick Wendell <pwend...@gmail.com> wrote:

> I would push back slightly. The reason we have the PR builds taking so
> long is death by a million small things that we add. Doing a full 2.11
> compile is order minutes... it's a nontrivial increase to the build times.
>
> It doesn't seem that bad to me to go back post-hoc once in a while and fix
> 2.11 bugs when they come up. It's on the order of once or twice per release
> and the typesafe guys keep a close eye on it (thanks!). Compare that to
> literally thousands of PR runs and a few minutes every time, IMO it's not
> worth it.
>
> On Fri, Oct 9, 2015 at 3:31 PM, Hari Shreedharan <
> hshreedha...@cloudera.com> wrote:
>
>> +1, much better than having a new PR each time to fix something for
>> scala-2.11 every time a patch breaks it.
>>
>> Thanks,
>> Hari Shreedharan
>>
>>
>>
>>
>> On Oct 9, 2015, at 11:47 AM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>> How about just fixing the warning? I get it; it doesn't stop this from
>>> happening again, but still seems less drastic than tossing out the
>>> whole mechanism.
>>>
>>
>> +1
>>
>> It also does not seem that expensive to test only compilation for Scala
>> 2.11 on PR builds.
>>
>>
>>
>


An alternate UI for Spark.

2015-09-14 Thread Prashant Sharma
Hi all,

TLDR;
Some of my colleagues at Imaginea are interested in building an alternate
UI for Spark. Basically allow people or groups to build an alternate UI for
Spark.

More Details:
Looking at feasibility, it feels definitely possible to do. But we need a
consensus on a public(can be experimental initially ) interface which would
give access to UI in core. Given this is done, their job will be easy.

Infact, it opens up a lot of possibilities for alternate UI for Apache
spark. Also considering a pluggable UI - where alternate UI can just be a
plugin. Ofcourse, implementing later can be a long term goal. Elasticsearch
is a good example of the later approach.

My knowledge on this is certainly limited. Comments and criticism
appreciated.

Thanks,
Prashant


Re: Make off-heap store pluggable

2015-07-20 Thread Prashant Sharma
+1 Looks like a nice idea(I do not see any harm). Would you like to work on
the patch to support it ?

Prashant Sharma



On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk 
alexey.goncha...@gmail.com wrote:

 Hello Spark community,

 I was looking through the code in order to understand better how RDD is
 persisted to Tachyon off-heap filesystem. It looks like that the Tachyon
 filesystem is hard-coded and there is no way to switch to another in-memory
 filesystem. I think it would be great if the implementation of the
 BlockManager and BlockStore would be able to plug in another filesystem.

 For example, Apache Ignite also has an implementation of in-memory
 filesystem which can store data in on-heap and off-heap formats. It would
 be great if it could integrate with Spark.

 I have filed a ticket in Jira:
 https://issues.apache.org/jira/browse/SPARK-9203

 If it makes sense, I will be happy to contribute to it.

 Thoughts?

 -Alexey (Apache Ignite PMC)



Re: Speeding up Spark build during development

2015-05-01 Thread Prashant Sharma
Hi Pramod,

If you are using sbt as your build, then you need to do sbt assembly once
and use sbt ~compile. Also export SPARK_PREPEND_CLASSES=1 this in your
shell and all nodes.
You can may be try this out ?

Thanks,

Prashant Sharma



On Fri, May 1, 2015 at 2:16 PM, Pramod Biligiri pramodbilig...@gmail.com
wrote:

 Hi,
 I'm making some small changes to the Spark codebase and trying it out on a
 cluster. I was wondering if there's a faster way to build than running the
 package target each time.
 Currently I'm using: mvn -DskipTests  package

 All the nodes have the same filesystem mounted at the same mount point.

 Pramod



Re: Semantics of LGTM

2015-01-19 Thread Prashant Sharma
Patrick's original proposal LGTM :).  However until now, I have been in the
impression of LGTM with special emphasis on TM part. That said, I will be
okay/happy(or Responsible ) for the patch, if it goes in.

Prashant Sharma



On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin r...@databricks.com wrote:

 Maybe just to avoid LGTM as a single token when it is not actually
 according to Patrick's definition, but anybody can still leave comments
 like:

 The direction of the PR looks good to me. or +1 on the direction

 The build part looks good to me

 ...


 On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu
 wrote:

  +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
 I've
  heard the semantics of LGTM expressed as I've looked at this
 thoroughly
  and take as much ownership as if I wrote the patch myself.  My
  understanding is that this is the level of review we expect for all
 patches
  that ultimately go into Spark, so it's important to have a way to
 concisely
  describe when this has been done.
 
  Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
  cases I've seen, if someone else says I looked at this very quickly and
  didn't see any glaring problems, it doesn't add any value for subsequent
  reviewers (someone still needs to take a thorough look).
 
  -Kay
 
  On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote:
 
   Yeah, the ASF +1 has become partly overloaded to mean both I would
 like
   to see this feature and this patch should be committed, although, at
   least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
 vote)
   should unambiguously mean the latter unless qualified in some other
 way.
  
   I don't have any opinion on the specific characters, but I agree with
   Aaron that it would be nice to have some sort of abbreviation for both
  the
   strong and weak forms of approval.
  
   -Sandy
  
On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   
I think the ASF +1 is *slightly* different than Google's LGTM,
 because
it might convey wanting the patch/feature to be merged but not
necessarily saying you did a thorough review and stand behind it's
technical contents. For instance, I've seen people pile on +1's to
 try
and indicate support for a feature or patch in some projects, even
though they didn't do a thorough technical review. This +1 is
definitely a useful mechanism.
   
There is definitely much overlap though in the meaning, though, and
it's largely because Spark had it's own culture around reviews before
it was donated to the ASF, so there is a mix of two styles.
   
Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
proposed originally (unlike the one Sandy proposed, e.g.). This is
what I've seen every project using the LGTM convention do (Google,
 and
some open source projects such as Impala) to indicate technical
sign-off.
   
- Patrick
   
On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com
 
   wrote:
I think I've seen something like +2 = strong LGTM and +1 = weak
  LGTM;
someone else should review before. It's nice to have a shortcut
 which
   isn't
a sentence when talking about weaker forms of LGTM.
   
On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote:
   
I think clarifying these semantics is definitely worthwhile. Maybe
  this
complicates the process with additional terminology, but the way
 I've
   used
these has been:
   
+1 - I think this is safe to merge and, barring objections from
  others,
would merge it immediately.
   
LGTM - I have no concerns about this patch, but I don't necessarily
   feel
qualified to make a final call about it.  The TM part acknowledges
  the
judgment as a little more subjective.
   
I think having some concise way to express both of these is useful.
   
-Sandy
   
On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com
   wrote:
   
Hey All,
   
Just wanted to ping about a minor issue - but one that ends up
  having
consequence given Spark's volume of reviews and commits. As much
 as
possible, I think that we should try and gear towards Google
 Style
LGTM on reviews. What I mean by this is that LGTM has the
 following
semantics:
   
I know this code well, or I've looked at it close enough to feel
confident it should be merged. If there are issues/bugs with this
  code
later on, I feel confident I can help with them.
   
Here is an alternative semantic:
   
Based on what I know about this part of the code, I don't see any
show-stopper problems with this patch.
   
The issue with the latter is that it ultimately erodes the
significance of LGTM, since subsequent reviewers need to reason
  about
what the person meant by saying LGTM. In contrast, having strong
semantics around LGTM can help

Re: sbt publish-local fails, missing spark-network-common

2014-11-22 Thread Prashant Sharma
Can you update to latest master and see if this issue exists.
On Nov 21, 2014 10:58 PM, pedrorodriguez ski.rodrig...@gmail.com wrote:

 Haven't found one yet, but work in AMPlab/at ampcamp so I will see if I can
 find someone who would know more about this (maybe reynold since he rolled
 out networking improvements for the PB sort). Good to have confirmation at
 least one other person is having problems with this rather than something
 isolated.

 -pedro



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/sbt-publish-local-fails-missing-spark-network-common-tp9471p9478.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Prashant Sharma
For scala 2.11.4, there are minor changes needed in repl code. I can do
that if that is a high priority.

Prashant Sharma



On Thu, Nov 13, 2014 at 11:59 AM, Prashant Sharma scrapco...@gmail.com
wrote:

 Thanks Patrick, I have one suggestion that we should make passing
 -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning this
 before. There is no way around not passing that option for maven
 users(only). However, this is unnecessary for sbt users because it is added
 automatically if -Pscala-2.11 is absent.


 Prashant Sharma



 On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen so...@cloudera.com wrote:

 - Tip: when you rebase, IntelliJ will temporarily think things like the
 Kafka module are being removed. Say 'no' when it asks if you want to
 remove
 them.
 - Can we go straight to Scala 2.11.4?

 On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey All,
 
  I've just merged a patch that adds support for Scala 2.11 which will
  have some minor implications for the build. These are due to the
  complexities of supporting two versions of Scala in a single project.
 
  1. The JDBC server will now require a special flag to build
  -Phive-thriftserver on top of the existing flag -Phive. This is
  because some build permutations (only in Scala 2.11) won't support the
  JDBC server yet due to transitive dependency conflicts.
 
  2. The build now uses non-standard source layouts in a few additional
  places (we already did this for the Hive project) - the repl and the
  examples modules. This is just fine for maven/sbt, but it may affect
  users who import the build in IDE's that are using these projects and
  want to build Spark from the IDE. I'm going to update our wiki to
  include full instructions for making this work well in IntelliJ.
 
  If there are any other build related issues please respond to this
  thread and we'll make sure they get sorted out. Thanks to Prashant
  Sharma who is the author of this feature!
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 





Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Prashant Sharma
One thing we can do it is print a helpful error and break. I don't know
about how this can be done, but since now I can write groovy inside maven
build so we have more control. (Yay!!)

Prashant Sharma



On Thu, Nov 13, 2014 at 12:05 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Yeah Sandy and I were chatting about this today and din't realize
 -Pscala-2.10 was mandatory. This is a fairly invasive change, so I was
 thinking maybe we could try to remove that. Also if someone doesn't
 give -Pscala-2.10 it fails in a way that is initially silent, which is
 bad because most people won't know to do this.

 https://issues.apache.org/jira/browse/SPARK-4375

 On Wed, Nov 12, 2014 at 10:29 PM, Prashant Sharma scrapco...@gmail.com
 wrote:
  Thanks Patrick, I have one suggestion that we should make passing
  -Pscala-2.10 mandatory for maven users. I am sorry for not mentioning
 this
  before. There is no way around not passing that option for maven
  users(only). However, this is unnecessary for sbt users because it is
 added
  automatically if -Pscala-2.11 is absent.
 
 
  Prashant Sharma
 
 
 
  On Wed, Nov 12, 2014 at 3:53 PM, Sean Owen so...@cloudera.com wrote:
 
  - Tip: when you rebase, IntelliJ will temporarily think things like the
  Kafka module are being removed. Say 'no' when it asks if you want to
 remove
  them.
  - Can we go straight to Scala 2.11.4?
 
  On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey All,
  
   I've just merged a patch that adds support for Scala 2.11 which will
   have some minor implications for the build. These are due to the
   complexities of supporting two versions of Scala in a single project.
  
   1. The JDBC server will now require a special flag to build
   -Phive-thriftserver on top of the existing flag -Phive. This is
   because some build permutations (only in Scala 2.11) won't support the
   JDBC server yet due to transitive dependency conflicts.
  
   2. The build now uses non-standard source layouts in a few additional
   places (we already did this for the Hive project) - the repl and the
   examples modules. This is just fine for maven/sbt, but it may affect
   users who import the build in IDE's that are using these projects and
   want to build Spark from the IDE. I'm going to update our wiki to
   include full instructions for making this work well in IntelliJ.
  
   If there are any other build related issues please respond to this
   thread and we'll make sure they get sorted out. Thanks to Prashant
   Sharma who is the author of this feature!
  
   - Patrick
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 



Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Prashant Sharma
+1, Sounds good.

Now I know whom to ping for what, even if I did not follow the whole
history of the project very carefully.

Prashant Sharma



On Thu, Nov 6, 2014 at 7:01 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi all,

 I wanted to share a discussion we've been having on the PMC list, as well
 as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed implementing a
 maintainer model for some of these components, similar to other large
 projects.

 As background on this, Spark has grown a lot since joining Apache. We've
 had over 80 contributors/month for the past 3 months, which I believe makes
 us the most active project in contributors/month at Apache, as well as over
 500 patches/month. The codebase has also grown significantly, with new
 libraries for SQL, ML, graphs and more.

 In this kind of large project, one common way to scale development is to
 assign maintainers to oversee key components, where each patch to that
 component needs to get sign-off from at least one of its maintainers. Most
 existing large projects do this -- at Apache, some large ones with this
 model are CloudStack (the second-most active project overall), Subversion,
 and Kafka, and other examples include Linux and Python. This is also
 by-and-large how Spark operates today -- most components have a de-facto
 maintainer.

 IMO, adopting this model would have two benefits:

 1) Consistent oversight of design for that component, especially regarding
 architecture and API. This process would ensure that the component's
 maintainers see all proposed changes and consider them to fit together in a
 good way.

 2) More structure for new contributors and committers -- in particular, it
 would be easy to look up who’s responsible for each module and ask them for
 reviews, etc, rather than having patches slip between the cracks.

 We'd like to start with in a light-weight manner, where the model only
 applies to certain key components (e.g. scheduler, shuffle) and user-facing
 APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
 it if we deem it useful. The specific mechanics would be as follows:

 - Some components in Spark will have maintainers assigned to them, where
 one of the maintainers needs to sign off on each patch to the component.
 - Each component with maintainers will have at least 2 maintainers.
 - Maintainers will be assigned from the most active and knowledgeable
 committers on that component by the PMC. The PMC can vote to add / remove
 maintainers, and maintained components, through consensus.
 - Maintainers are expected to be active in responding to patches for their
 components, though they do not need to be the main reviewers for them (e.g.
 they might just sign off on architecture / API). To prevent inactive
 maintainers from blocking the project, if a maintainer isn't responding in
 a reasonable time period (say 2 weeks), other committers can merge the
 patch, and the PMC will want to discuss adding another maintainer.

 If you'd like to see examples for this model, check out the following
 projects:
 - CloudStack:
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
 
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
 
 - Subversion:
 https://subversion.apache.org/docs/community-guide/roles.html 
 https://subversion.apache.org/docs/community-guide/roles.html

 Finally, I wanted to list our current proposal for initial components and
 maintainers. It would be good to get feedback on other components we might
 add, but please note that personnel discussions (e.g. I don't think Matei
 should maintain *that* component) should only happen on the private list.
 The initial components were chosen to include all public APIs and the main
 core components, and the maintainers were chosen from the most active
 contributors to those modules.

 - Spark core public API: Matei, Patrick, Reynold
 - Job scheduler: Matei, Kay, Patrick
 - Shuffle and network: Reynold, Aaron, Matei
 - Block manager: Reynold, Aaron
 - YARN: Tom, Andrew Or
 - Python: Josh, Matei
 - MLlib: Xiangrui, Matei
 - SQL: Michael, Reynold
 - Streaming: TD, Matei
 - GraphX: Ankur, Joey, Reynold

 I'd like to formally call a [VOTE] on this model, to last 72 hours. The
 [VOTE] will end on Nov 8, 2014 at 6 PM PST.

 Matei


Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-01 Thread Prashant Sharma
Easy or quicker way to build spark is

sbt/sbt assembly/assembly

Prashant Sharma




On Mon, Sep 1, 2014 at 8:40 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 If this is not a confirmed regression from 1.0.2, I think it's better to
 report it in a separate thread or JIRA.

 I believe serious regressions are generally the only reason to block a new
 release. Otherwise, if this is an old issue, it should be handled
 separately.

 2014년 9월 1일 월요일, chutiumteng@gmail.com님이 작성한 메시지:

  i didn't tried with 1.0.2
 
  it takes always too long to build spark assembly jars... more than 20min
 
  [info] Packaging
 
 
 /mnt/some-nfs/common/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar
  ...
  [info] Packaging
 
 
 /mnt/some-nfs/common/spark/examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar
  ...
  [info] Done packaging.
  [info] Done packaging.
  [success] Total time: 1582 s, completed Sep 1, 2014 1:39:21 PM
 
  is there some easily way to exclude some modules such as spark/examples
 or
  spark/external ?
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8163.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
  For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
 
 



Re: [SNAPSHOT] Snapshot1 of Spark 1.1.0 has been posted

2014-08-08 Thread Prashant Sharma
Yeah this should be changed. You can change the banner in the repl,
printWelcome function. Mind sending a PR ?

I think this should be a one place change in the future (Not sure how
feasible it is). Volunteers ?


Prashant Sharma




On Fri, Aug 8, 2014 at 12:48 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi Patrick,

 I am testing the 1.1 branch but I see lot of protobuf warnings while
 building the jars:

 [warn] Class com.google.protobuf.Parser not found - continuing with a stub.

 [warn] Class com.google.protobuf.Parser not found - continuing with a stub.

 [warn] Class com.google.protobuf.Parser not found - continuing with a stub.

 [warn] Class com.google.protobuf.Parser not found - continuing with a stub.

 I am compiling for 2.3.0cdh5.0.2...Later when running the jobs I am getting
 a protobuf error:

 Exception in thread main java.lang.VerifyError: class

 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$AddBlockRequestProto
 overrides final method
 getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;

 Is there a protobuf issue on this branch ?

 Also on the branch at least I am noticing the following:

 Welcome to

     __

  / __/__  ___ _/ /__

 _\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT

   /_/

 Won't it be 1.1.0-SNAPSHOT ?

 Thanks.

 Deb


 On Wed, Aug 6, 2014 at 11:24 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Minor correction: the encoded URL in the staging repo link was wrong.
  The correct repo is:
  https://repository.apache.org/content/repositories/orgapachespark-1025/
 
 
  On Wed, Aug 6, 2014 at 11:23 PM, Patrick Wendell pwend...@gmail.com
  wrote:
  
   Hi All,
  
   I've packaged and published a snapshot release of Spark 1.1 for
 testing.
  This is being distributed to the community for QA and preview purposes.
 It
  is not yet an official RC for voting. Going forward, we'll do preview
  releases like this for testing ahead of official votes.
  
   The tag of this release is v1.1.0-snapshot1 (commit d428d8):
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d428d88418d385d1d04e1b0adcb6b068efe9c7b0
  
   The release files, including signatures, digests, etc can be found at:
   http://people.apache.org/~pwendell/spark-1.1.0-snapshot1/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1025/
  
   NOTE: Due to SPARK-2899, docs are not yet available for this release.
  Docs will be posted ASAP.
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Welcoming two new committers

2014-08-08 Thread Prashant Sharma
Congratulations Andrew and Joey.

Prashant Sharma




On Fri, Aug 8, 2014 at 2:10 PM, Xiangrui Meng men...@gmail.com wrote:

 Congrats, Joey  Andrew!!

 -Xiangrui

 On Fri, Aug 8, 2014 at 12:14 AM, Christopher Nguyen c...@adatao.com
 wrote:
  +1 Joey  Andrew :)
 
  --
  Christopher T. Nguyen
  Co-founder  CEO, Adatao http://adatao.com [ah-'DAY-tao]
  linkedin.com/in/ctnguyen
 
 
 
  On Thu, Aug 7, 2014 at 10:39 PM, Joseph Gonzalez 
 jegon...@eecs.berkeley.edu
  wrote:
 
  Hi Everyone,
 
  Thank you for inviting me to be a committer.  I look forward to working
  with everyone to ensure the continued success of the Spark project.
 
  Thanks!
  Joey
 
 
 
 
  On Thu, Aug 7, 2014 at 9:57 PM, Matei Zaharia ma...@databricks.com
  wrote:
 
   Hi everyone,
  
   The PMC recently voted to add two new committers and PMC members: Joey
   Gonzalez and Andrew Or. Both have been huge contributors in the past
 year
   -- Joey on much of GraphX as well as quite a bit of the initial work
 in
   MLlib, and Andrew on Spark Core. Join me in welcoming them as
 committers!
  
   Matei
  
  
  
  
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: FYI -- javax.servlet dependency issue workaround

2014-05-27 Thread Prashant Sharma
Also just for sake of completeness, sometimes the desired dependency might
just be an older version in that case even if you include it like above it
may get evicted (Sbt's default strategy for conflict manager is to choose
the latest version).

So to further ensure that it does include it. We can
libraryDependencies += org.mortbay.jetty % servlet-api % 3.0.20100224
*force()*

force it.

Thanks

Prashant Sharma


On Wed, May 28, 2014 at 10:38 AM, Kay Ousterhout k...@eecs.berkeley.eduwrote:

 Hi all,

 I had some trouble compiling an application (Shark) against Spark 1.0,
 where Shark had a runtime exception (at the bottom of this message) because
 it couldn't find the javax.servlet classes.  SBT seemed to have trouble
 downloading the servlet APIs that are dependencies of Jetty (used by the
 Spark web UI), so I had to manually add them to the application's build
 file:

 libraryDependencies += org.mortbay.jetty % servlet-api % 3.0.20100224

 Not exactly sure why this happens but thought it might be useful in case
 others run into the same problem.

 -Kay

 -

 Exception in thread main java.lang.NoClassDefFoundError:
 javax/servlet/FilterRegistration

 at

 org.eclipse.jetty.servlet.ServletContextHandler.init(ServletContextHandler.java:136)

 at

 org.eclipse.jetty.servlet.ServletContextHandler.init(ServletContextHandler.java:129)

 at

 org.eclipse.jetty.servlet.ServletContextHandler.init(ServletContextHandler.java:98)

 at
 org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:98)

 at
 org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:89)

 at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:64)

 at org.apache.spark.ui.WebUI$anonfun$attachTab$1.apply(WebUI.scala:57)

 at org.apache.spark.ui.WebUI$anonfun$attachTab$1.apply(WebUI.scala:57)

 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:57)

 at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:66)

 at org.apache.spark.ui.SparkUI.init(SparkUI.scala:60)

 at org.apache.spark.ui.SparkUI.init(SparkUI.scala:42)

 at org.apache.spark.SparkContext.init(SparkContext.scala:222)

 at org.apache.spark.SparkContext.init(SparkContext.scala:85)

 at shark.SharkContext.init(SharkContext.scala:42)

 at shark.SharkContext.init(SharkContext.scala:61)

 at shark.SharkEnv$.initWithSharkContext(SharkEnv.scala:78)

 at shark.SharkEnv$.init(SharkEnv.scala:38)

 at shark.SharkCliDriver.init(SharkCliDriver.scala:280)

 at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)

 at shark.SharkCliDriver.main(SharkCliDriver.scala)

 Caused by: java.lang.ClassNotFoundException:
 javax.servlet.FilterRegistration

 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

 at java.security.AccessController.doPrivileged(Native Method)

 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)

 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)

 ... 23 more



Re: Apache spark on 27gb wikipedia data

2014-05-05 Thread Prashant Sharma
I just thought may be we could put a warning whenever that error comes user
can tune either memoryFraction or executor memory options. And this warning
get's displayed when TaskSetManager receives task failures due to  OOM.

Prashant Sharma


On Mon, May 5, 2014 at 2:10 PM, Ajay Nair prodig...@gmail.com wrote:

 Hi,

 I am using 1 master and 3 slave workers for processing 27gb of Wikipedia
 data that is tab separated and every line contains wikipedia page
 information. The tab separated data has title of the page and the page
 contents. I am using the regular expression to extract links as mentioned
 in
 the site below:

 http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html#running-pagerank-on-wikipedia

 Although it runs fne for around 300Mb data set, it runs in to issues when I
 try to execute the same code using the 27gb data from hdfs.
 The error thrown is given below:
 14/05/05 07:15:22 WARN scheduler.TaskSetManager: Loss was due to
 java.lang.OutOfMemoryError
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 at java.util.regex.Matcher.init(Matcher.java:224)

 Is there any way to over come this issue?

 My cluster is a ec2 m3.large machine.

 Thanks
 Ajay



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-spark-on-27gb-wikipedia-data-tp6487.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



Re: Fw: Is there any way to make a quick test on some pre-commit code?

2014-04-24 Thread Prashant Sharma
Not sure but I use sbt/sbt ~compile instead of package. Any reason we use
package instead of compile(which is slightly faster ofc.)


Prashant Sharma


On Thu, Apr 24, 2014 at 1:32 PM, Patrick Wendell pwend...@gmail.com wrote:

 This is already on the wiki:

 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools



 On Wed, Apr 23, 2014 at 6:52 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  I'm just asked by others for the same question
 
  I think Reynold gave a pretty helpful tip on this,
 
  Shall we put this on Contribute-to-Spark wiki?
 
  --
  Nan Zhu
 
 
  Forwarded message:
 
   From: Reynold Xin r...@databricks.com
   Reply To: d...@spark.incubator.apache.org
   To: d...@spark.incubator.apache.org d...@spark.incubator.apache.org
   Date: Thursday, February 6, 2014 at 7:50:57 PM
   Subject: Re: Is there any way to make a quick test on some pre-commit
  code?
  
   You can do
  
   sbt/sbt assemble-deps
  
  
   and then just run
  
   sbt/sbt package
  
   each time.
  
  
   You can even do
  
   sbt/sbt ~package
  
   for automatic incremental compilation.
  
  
  
   On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu zhunanmcg...@gmail.com(mailto:
  zhunanmcg...@gmail.com) wrote:
  
Hi, all
   
Is it always necessary to run sbt assembly when you want to test some
  code,
   
Sometimes you just repeatedly change one or two lines for some failed
  test
case, it is really time-consuming to sbt assembly every time
   
any faster way?
   
Best,
   
--
Nan Zhu
   
  
  
  
  
 
 
 



Re: Suggest to workaround the org.eclipse.jetty.orbit problem with SBT 0.13.2-RC1

2014-03-25 Thread Prashant Sharma
I think we should upgrade sbt, I have been using sbt since 13.2-M1 and have
not spotted any issues. So RC1 should be good + it has the fast incremental
compilation.

Prashant Sharma


On Wed, Mar 26, 2014 at 10:41 AM, Will Benton wi...@redhat.com wrote:

 - Original Message -

  At last, I worked around this issue by updating my local SBT to
 0.13.2-RC1.
  If any of you are experiencing similar problem, I suggest you upgrade
 your
  local SBT version.

 If this issue is causing grief for anyone on Fedora 20, know that you can
 install sbt via yum and get a sbt 0.13.1 that has been patched to use Ivy
 2.3.0 instead of Ivy 2.3.0-rc1.  Obviously, this isn't a solution for
 everyone, but it's certainly cleaner than building your own sbt locally if
 you're using Fedora.  (If you try this out and run in to any trouble,
 please let me know off-list and I'll help out.)



 best,
 wb